Chapter 2: Describing Contingency Tables - I

Similar documents
STAT 705: Analysis of Contingency Tables

n y π y (1 π) n y +ylogπ +(n y)log(1 π).

STAC51: Categorical data Analysis

Chapter 2: Describing Contingency Tables - II

2 Describing Contingency Tables

Describing Contingency tables

Lecture 8: Summary Measures

ST3241 Categorical Data Analysis I Two-way Contingency Tables. 2 2 Tables, Relative Risks and Odds Ratios

Introduction. Dipankar Bandyopadhyay, Ph.D. Department of Biostatistics, Virginia Commonwealth University

Testing Independence

Lecture 24. Ingo Ruczinski. November 24, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University

Lecture 01: Introduction

Measures of Association and Variance Estimation

Comparing p s Dr. Don Edwards notes (slightly edited and augmented) The Odds for Success

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Lecture 25. Ingo Ruczinski. November 24, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University

Chapter 11: Models for Matched Pairs

Sections 2.3, 2.4. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis 1 / 21

Contingency Tables. Contingency tables are used when we want to looking at two (or more) factors. Each factor might have two more or levels.

Contingency Tables Part One 1

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Statistics 3858 : Contingency Tables

Probability and Probability Distributions. Dr. Mohammed Alahmed

An introduction to biostatistics: part 1

STAT Chapter 13: Categorical Data. Recall we have studied binomial data, in which each trial falls into one of 2 categories (success/failure).

Categorical Variables and Contingency Tables: Description and Inference

STAT 526 Spring Final Exam. Thursday May 5, 2011

Binomial and Poisson Probability Distributions

STAC51H3:Categorical Data Analysis Assign 2 Due: Thu Feb 9, 2016 in class All relevant work must be shown for credit.

Chapter 5: Logistic Regression-I

For more information about how to cite these materials visit

Contingency Tables. Safety equipment in use Fatal Non-fatal Total. None 1, , ,128 Seat belt , ,878

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Categorical Data Analysis Chapter 3

PubH 7470: STATISTICS FOR TRANSLATIONAL & CLINICAL RESEARCH

Hypothesis Testing, Power, Sample Size and Confidence Intervals (Part 2)

Probability Basics. Part 3: Types of Probability. INFO-1301, Quantitative Reasoning 1 University of Colorado Boulder

Statistics in medicine

Two-sample Categorical data: Testing

Person-Time Data. Incidence. Cumulative Incidence: Example. Cumulative Incidence. Person-Time Data. Person-Time Data

Inference for Binomial Parameters

Chapter 4: Generalized Linear Models-I

Two-sample Categorical data: Testing

Discrete Distributions

Machine Learning

Logistic regression modeling the probability of success

Lecture 5: ANOVA and Correlation

Causal Inference with General Treatment Regimes: Generalizing the Propensity Score

8 Nominal and Ordinal Logistic Regression

Biostatistics Advanced Methods in Biostatistics IV

Unit 9: Inferences for Proportions and Count Data

BIOL 51A - Biostatistics 1 1. Lecture 1: Intro to Biostatistics. Smoking: hazardous? FEV (l) Smoke

Data Analysis and Statistical Methods Statistics 651

Let us think of the situation as having a 50 sided fair die; any one number is equally likely to appear.

13.1 Categorical Data and the Multinomial Experiment

Some comments on Partitioning

ST3241 Categorical Data Analysis I Two-way Contingency Tables. Odds Ratio and Tests of Independence

Tests for Population Proportion(s)

Unit 9: Inferences for Proportions and Count Data

Epidemiology Wonders of Biostatistics Chapter 11 (continued) - probability in a single population. John Koval

Lecture 41 Sections Mon, Apr 7, 2008

12 Generalized linear models

ij i j m ij n ij m ij n i j Suppose we denote the row variable by X and the column variable by Y ; We can then re-write the above expression as

Probability and Discrete Distributions

Faculty of Health Sciences. Regression models. Counts, Poisson regression, Lene Theil Skovgaard. Dept. of Biostatistics

One-Way Tables and Goodness of Fit

Review. More Review. Things to know about Probability: Let Ω be the sample space for a probability measure P.

Generalized logit models for nominal multinomial responses. Local odds ratios

Chapter Six: Two Independent Samples Methods 1/51

DISCRETE PROBABILITY DISTRIBUTIONS

Chapter 4: Generalized Linear Models-II

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

3 Way Tables Edpsy/Psych/Soc 589

Describing Stratified Multiple Responses for Sparse Data

Confidence Intervals of the Simple Difference between the Proportions of a Primary Infection and a Secondary Infection, Given the Primary Infection

1 Comparing two binomials

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto.

Human-Oriented Robotics. Probability Refresher. Kai Arras Social Robotics Lab, University of Freiburg Winter term 2014/2015

Chapter 10: Chi-Square and F Distributions

Statistics 135 Fall 2008 Final Exam

Means or "expected" counts: j = 1 j = 2 i = 1 m11 m12 i = 2 m21 m22 True proportions: The odds that a sampled unit is in category 1 for variable 1 giv

Example. χ 2 = Continued on the next page. All cells

Log-linear Models for Contingency Tables

Stat 587: Key points and formulae Week 15

More Statistics tutorial at Logistic Regression and the new:

Discrete Multivariate Statistics

Statistics. Nicodème Paul Faculté de médecine, Université de Strasbourg

Discrete Probability. Chemistry & Physics. Medicine

15: CHI SQUARED TESTS

Math 124: Modules Overall Goal. Point Estimations. Interval Estimation. Math 124: Modules Overall Goal.

Poisson regression: Further topics

Cohen s s Kappa and Log-linear Models

Probability: Why do we care? Lecture 2: Probability and Distributions. Classical Definition. What is Probability?

BIOS 625 Fall 2015 Homework Set 3 Solutions

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

Relate Attributes and Counts

Relative Risks (RR) and Odds Ratios (OR) 20

Salt Lake Community College MATH 1040 Final Exam Fall Semester 2011 Form E

Transcription:

: Describing Contingency Tables - I Dipankar Bandyopadhyay Department of Biostatistics, Virginia Commonwealth University BIOS 625: Categorical Data & GLM [Acknowledgements to Tim Hanson and Haitao Chu] D. Bandyopadhyay (VCU) 1 / 26

2.1.1 Contingency Tables Let X and Y be categorical variables measured on an a subject with I and J levels respectively. Each subject sampled will have an associated (X,Y); e.g. (X,Y) = (female, Republican). For the gender variable X, I = 2, and for the political affiliation Y, we might have J = 3. Say n individuals are sampled and cross-classified according to their outcome (X,Y). A contingency table places the raw number of subjects falling into each cross-classification category into the table cells. We call such a table an I J table. D. Bandyopadhyay (VCU) 2 / 26

If we relabel the category outcomes to be integers 1 X I and 1 Y J (i.e. turn our experimental outcomes into random variables), we can simplify notation. In the abstract, a contingency table looks like: n ij Y = 1 Y = 2 Y = J Totals X = 1 n 11 n 12 n 1J n 1+ X = 2 n 21 n 22 n 2J n 2+............ X = I n I1 n I2 n IJ n I+ Totals n +1 n +2 n +J n = n ++ If subjects are randomly sampled from the population and cross-classified, both X and Y are random and (X,Y) has a bivariate discrete joint distribution. Let π ij = P(X = i,y = j), the probability of falling into the (i,j) th (row,column) in the table. D. Bandyopadhyay (VCU) 3 / 26

2.1.2 Joint/Marginal/Conditional Distributions From Chapter 2 in Christensen (1997) we have a sample of n = 52 males aged 11 to 30 years with knee operations via arthroscopic surgery. They are cross-classified according to X = 1,2,3 for injury type (twisted knee, direct blow, or both) and Y = 1,2,3 for surgical result (excellent, good, or fair-to-poor). n ij Excellent Good Fair to poor Totals Twisted knee 21 11 4 36 Direct blow 3 2 2 7 Both types 7 1 1 9 Totals 31 14 7 n = 52 with theoretical joint probabilities: π ij Excellent Good Fair to poor Totals Twisted knee π 11 π 12 π 13 π 1+ Direct blow π 21 π 22 π 23 π 2+ Both types π 31 π 32 π 33 π 3+ Totals π +1 π +2 π +3 π ++ = 1 D. Bandyopadhyay (VCU) 4 / 26

The marginal probabilities that X = i or Y = j are P(X = i) = J J P(X = i,y = j) = π ij = π i+. j=1 j=1 P(Y = j) = I P(X = i,y = j) = i=1 I π ij = π +j. A + in place of a subscript denotes a sum of all elements over that subscript. We must have i=1 π ++ = I J π ij = 1. i=1 j=1 The counts have a multinomial distribution n mult(n, π) where n = [n ij ] I J and π = [π ij ] I J. What is (n 1+,...,n I+ ) distributed? D. Bandyopadhyay (VCU) 5 / 26

Often the marginal counts for X or Y are fixed by design. For example, in a case-control study, a fixed number of cases (e.g. people w/ lung cancer) and a fixed number of controls (no lung cancer) are sampled. Then, a risk factor or exposure Y is compared among cases and controls within the table. This results in a separate multinomial distribution for each level of X. Another example is a clinical trial, where the number receiving treatment A and the number receiving treatment B are both fixed. For the product of I multinomial distributions, the conditional probabilities of falling into Y = j must sum to one for each level of X = i: J j=1 π Y X j i = 1 for i = 1,...,I. The notation gets out of hand. We will simplify π Y X j i to just π j i or perhaps π ij depending on the situation, and make clear how the data are sampled. D. Bandyopadhyay (VCU) 6 / 26

The following 2 3 contingency table is from a report by the Physicians Health Study Research Group on n = 22,071 physicians that took either a placebo or aspirin every other day. Fatal attack Nonfatal attack No attack Placebo 18 171 10,845 Aspirin 5 99 10,933 Here we have placed the probabilities of each classification into each cell: Placebo Aspirin Fatal attack Nonfatal attack No attack π Y X π Y X π Y X 1 1 2 1 3 1 π Y X 1 2 π Y X 2 2 π Y X 3 2 The row totals n 1+ = 11,034 and n 2+ = 11,037 are fixed and thus π Y X 1 1 +π Y X 2 1 +π Y X 3 1 = 1 and π Y X 1 2 +π Y X 2 2 +π Y X 3 2 = 1. Want to compare probabilities in each column. D. Bandyopadhyay (VCU) 7 / 26

2.1.3 Sensitivity and Specificity for Medical Diagnoses Diagnostic tests indicate the presence or absence of a disease or infection. Tests are typically imperfect, i.e. there is positive probability of incorrectly diagnosing a subject as not infected when they are in fact infected and vice-versa. Let D+ or D be the true infection/disease status and T+ or T be the result of a diagnostic test. Sensitivity = P(T + D+). Specificity = P(T D ). D. Bandyopadhyay (VCU) 8 / 26

Let π 11 = P(T+,D+),π 12 = P(T+,D ),π 21 = P(T,D+),π 22 = P(T,D ).. Let subjects be randomly sampled from the population so that π 11 +π 12 +π 21 +π 22 = 1. Then sensitivity is given by S e = P(T + D+) = P(T+,D+)/P(D+) = π 11 /π 1+ and specificity by S p = P(T D ) = P(T,D )/P(D ) = π 22 /π +2 To get MLEs for sensitivity and specificity, simply replace each π ij by its MLE ˆπ ij = n ij /n where n ij is the number falling into category (i,j) and n = n ++. If n 1 = n 1+ and n 0 = n +2 are fixed ahead of time, we have product multinomial sampling. The MLEs are exactly the same. D. Bandyopadhyay (VCU) 9 / 26

Example: Strep test Sheeler et al. (2002) describe a modest prospective trial of n = 232 individuals complaining of sore throat who were given the rapid strep (streptococcal pharyngitis) test T. The true status of each individual D was determined by throat culture. A 2 2 contingency table looks like: Table : Strep Test Results D+ D- Total T+ 44 4 48 T- 19 165 184 Total 63 169 232 D. Bandyopadhyay (VCU) 10 / 26

Table : Strep Test Results D+ D- Total T+ 44 4 48 T- 19 165 184 Total 63 169 232 An estimate of S e is Ŝe = ˆP(T + D+) = 44 63 = 0.70 An estimate of S p is Ŝp = ˆP(T D ) = 165 169 = 0.98 The estimated prevalence of strep among those complaining of sore throat P(D+) is ˆP(D+) = 63 232 = 0.27 D. Bandyopadhyay (VCU) 11 / 26

2.1.4 Independence of Categorical Variables When (X,Y) are jointly distributed, X and Y are independent if Let and P(X = i,y = j) = P(X = i)p(y = j) or π ij = π i+ π +j. π X Y i j = P(X = i Y = j) = π ij /π +j π Y X j i = P(Y = j X = i) = π ij /π i+. Then independence of X and Y implies P(X = i Y = j) = P(X = i) and P(Y = j X = i) = P(Y = j). The probability of any given column response is the same for each row. The probability for any given row response is the same for each column. D. Bandyopadhyay (VCU) 12 / 26

2.1.5 Poisson, Binomial, Multinomial Sampling Let n ij be the cell count in the (i,j) th classification. When the sample size n = n ++ is random, Poisson sampling assumes ind. n ij Poisson(µ ij ). Then p(n µ) = L(µ) = I J i=1 j=1 e µ ij µ n ij ij /n ij!. When n = n ++ is fixed but the row n i+ and column n +j totals are not we have multinomial sampling and p(n π) = L(π) = n! I J i=1 j=1 π n ij ij /n ij!. D. Bandyopadhyay (VCU) 13 / 26

Finally, sometimes row (or column) columns are fixed ahead of time (e.g. sampling n 1+ = women and n 2+ men and asking them if they smoke). Then we have product multinomial sampling. Agresti prefers using n i = n i+ for simplicity. For a fixed X = i, there are J counts (n i1,n i2,...,n ij ) adding to n i+ and this vector is multinomial. Since there are I values of covariate X, we have I independent multinomial distributions, or the product of I mult(n i,π i ) distributions where π i = (π 1 i,...,π J i ). p(n π) = L(π) = I J n i! π n ij j i /n ij!. Note: under product multinomial sampling, only conditional probabilities π j i can be estimated. To estimate π ij requires information on the π i+ occurring naturally. i=1 j=1 D. Bandyopadhyay (VCU) 14 / 26

2.1.6 Seat Belt Example Mass. Hwy. Dept. to study seat belt use Y (yes, no) and fatality (fatal, not fatal) X of crashes on the Mass. Turnpike. Could just analyze data as they arise naturally. Then n ij ind. Pois(µ ij ). Poisson sampling. If n = 200 police records sampled from crashes on turnpike. Then (n 11,n 12,n 21,n 22 ) is mult(200,π). Multinomial sampling. Could sample n 1 = 100 fatal crash reports and n 2 = 100 nonfatal reports. Then (n 11,n 12 ) mult(100,(π 1 1,π 2 1 )) independent of (n 21,n 22 ) mult(100,(π 1 2,π 2 2 )). Product multinomial sampling. Here, there s no information on the prevalence of fatal versus non-fatal accidents. Read experimental design approach. D. Bandyopadhyay (VCU) 15 / 26

2.1.8 Types of Studies Case Control Smoker 688 650 Non-smoker 21 59 Total 709 709 In a case/control study, fixed numbers of cases n 1 and controls n 2 are (randomly) selected and exposure variables of interest recorded. In the above study we can compare the relative proportions of those who smoke within those that developed lung cancer (cases) and those that did not (controls). We can measure association between smoking and lung cancer, but cannot infer causation. These data were collected after the fact. Data usually cheap and easy to get. Above: lung cancer (p. 42). D. Bandyopadhyay (VCU) 16 / 26

Prospective studies start with a sample and observe them through time. Clinical trial randomly allocates smoking and non-smoking treatments to experimental units and then sees who ends up with lung cancer or not. Problem with ethics here. A cohort study simply follows subjects after letting them assign their own treatments (i.e. smoking or non-smoking) and records outcomes. A cross-sectional design samples n subjects from a population and cross-classifies them. Carefully read this section. Classify each study as multinomial or product multinomial. D. Bandyopadhyay (VCU) 17 / 26

2.2 Comparing Two Proportions 2.2.1 & 2.2.2 Difference of Proportions & Relative Risk Let X and Y be dichotomous. Let π 1 = P(Y = 1 X = 1) and let π 2 = P(Y = 1 X = 2). The difference in probability of Y = 1 when X = 1 versus X = 2 is π 1 π 2. The relative risk π 1 /π 2 may be more informative for rare outcomes. However it may also exaggerate the effect of X = 1 versus X = 2 as well and cloud issues. D. Bandyopadhyay (VCU) 18 / 26

2.2 Comparing Two Proportions Example: Let Y = 1 indicate presence of a disease and X = 1 indicate an exposure. When π 2 = 0.001 and π 1 = 0.01, π 1 π 2 = 0.009. However, π 1 /π 2 = 10. You are 10 times more likely to get the disease when X = 1 than X = 2. However, in either case the probability of getting the disease 0.01. When π 2 = 0.401 and π 1 = 0.41, π 1 π 2 = 0.009. However, π 1 /π 2 = 1.02. You are 2% more likely to get the disease when X = 1 than X = 2. This doesn t seem as drastic as 1000%. These sorts of comparisons figure into reporting results concerning public health and safety information. e.g. Hormone therapy for post-menopausal women, relative safety of SUVs versus sedans, etc. D. Bandyopadhyay (VCU) 19 / 26

2.2 Comparing Two Proportions 2.2.3 & 2.2.4 Odds Ratios The odds of success (say Y = 1) versus failure (Y = 2) are Ω = π/(1 π) where π = P(Y = 1). When someone says 3 to 1 odds the Vikings will win, they mean Ω = 3 which implies the probability the Vikings will win is 0.75, from π = Ω/(Ω+1). Odds measure the relative rates of success and failure. An odds ratio compares relatives rates of success (or disease or whatever) across two exposures X = 1 and X = 2: θ = Ω 1 Ω 2 = π 1/(1 π 1 ) π 2 /(1 π 2 ). Odds ratios are always positive and a ratio > 1 indicates the relative rate of success for X = 1 is greater than for X = 2. However, the odds ratio gives no information on the probabilities π 1 = P(Y = 1 X = 1) and π 2 = P(Y = 1 X = 2). D. Bandyopadhyay (VCU) 20 / 26

2.2 Comparing Two Proportions Different values for these parameters can lead to the same odds ratio. Example: π 1 = 0.833 & π 2 = 0.5 yield θ = 5.0. So does π 1 = 0.0005 & π 2 = 0.0001. One set of values might imply a different decision than the other, but θ = 5.0 in both cases. Here, the relative risk is about 1.7 and 5 respectively. Note that when dealing with a rare outcome, where π i 0, the relative risk is approximately equal to the odds ratio. D. Bandyopadhyay (VCU) 21 / 26

2.2 Comparing Two Proportions When θ = 1 we must have Ω 1 = Ω 2 which further implies that π 1 = π 2 and hence Y does not depend on the value of X. If (X,Y) are both random then X and Y are stochastically independent. An important property of odds ratio is the following: θ = = P(Y = 1 X = 1)/P(Y = 2 X = 1) P(Y = 1 X = 2)/P(Y = 2 X = 2) P(X = 1 Y = 1)/P(X = 2 Y = 1) P(X = 1 Y = 2)/P(X = 2 Y = 2) You should verify this formally. This implies that for the purposes of estimating an odds ratio, it does not matter if data are sampled prospectively, retrospectively, or cross-sectionally. The common odds ratio is estimated ˆθ = n 11 n 22 /[n 12 n 21 ]. D. Bandyopadhyay (VCU) 22 / 26

2.2 Comparing Two Proportions 2.2.6 Case-Control Studies and the Odds Ratio Recall there are n 1 = n 2 = 709 lung cancer cases and (non-lung cancer) controls in slide 16. The margins are fixed and we have product multinomial sampling We can estimate π 1 1 = P(Y = 1 X = 1) = n 11 /n 1+ and π 1 2 = P(Y = 1 X = 2) = n 21 /n 2+ but not P(X = 1 Y = 1) or P(X = 1 Y = 2). However, for the purposes of estimating θ it does not matter! For the lung cancer case/control data, ˆθ = 688 59/[21 650] = 3.0 to one decimal place. D. Bandyopadhyay (VCU) 23 / 26

2.2 Comparing Two Proportions Odds of lung cancer The odds of being a smoker is 3 times greater for those that develop lung cancer than for those that do not. The odds of developing lung cancer is 3 times greater for smokers than for non-smokers. The second interpretation is more relevant when deciding whether or not you should take up recreational smoking. Note that we cannot estimate the relative risk of developing lung cancer for smokers P(X = 1 Y = 1)/P(X = 1 Y = 2). D. Bandyopadhyay (VCU) 24 / 26

2.2 Comparing Two Proportions Recall the MLE of θ is ˆθ = n 11 n 22 /[n 12 n 21 ] (from either multinomial or product multinomial sampling!) The standard error of ˆθ can be computed as usual from large sample considerations. However, the distribution of log ˆθ is better approximated by a normal distribution in smaller samples. The standard error of the log-odds ratio is 1 se{log(ˆθ)} = + 1 + 1 + 1. n 11 n 22 n 12 n 21 For example, if we wanted a 95% CI for the odds ratio from a 2 2 table we compute (exp{log ˆθ 1.96se[log(ˆθ)]},exp{log ˆθ +1.96se[log(ˆθ)]}). For the smoking data, this yields (1.79, 4.95). We are 95% confident that the true odds ratio is between 1.8 and 5.0. D. Bandyopadhyay (VCU) 25 / 26

2.2 Comparing Two Proportions Formally comparing groups You should convince yourself that the following statements are equivalent: π 1 π 2 = 0, the difference in proportions is zero. π 1 /π 2 = 1, the relative risk is one. θ = [π 1 /(1 π 1 )]/[π 2 /(1 π 2 )] = 1, the odds ratio is one. All of these imply that there is no difference between groups for the outcome being measured, i.e. Y is independent of X, written as Y X. D. Bandyopadhyay (VCU) 26 / 26