Contingency Tables Part One 1

Similar documents
STA441: Spring Multiple Regression. This slide show is a free open source document. See the last slide for copyright information.

Power and Sample Size for Log-linear Models 1

Generalized Linear Models 1

The Multinomial Model

STA441: Spring Multiple Regression. More than one explanatory variable at the same time

Categorical Data Analysis 1

Statistics 3858 : Contingency Tables

Joint Distributions: Part Two 1

Describing Contingency tables

Introduction 1. STA442/2101 Fall See last slide for copyright information. 1 / 33

STA 2101/442 Assignment 3 1

Interactions and Factorial ANOVA

The Bootstrap 1. STA442/2101 Fall Sampling distributions Bootstrap Distribution-free regression example

Interactions and Factorial ANOVA

Factorial ANOVA. STA305 Spring More than one categorical explanatory variable

STA 431s17 Assignment Eight 1

Ling 289 Contingency Table Statistics

STA 302f16 Assignment Five 1

ij i j m ij n ij m ij n i j Suppose we denote the row variable by X and the column variable by Y ; We can then re-write the above expression as

Chapter 6. Logistic Regression. 6.1 A linear model for the log odds

Random Vectors 1. STA442/2101 Fall See last slide for copyright information. 1 / 30

STA 2101/442 Assignment Four 1

Factorial ANOVA. More than one categorical explanatory variable. See last slide for copyright information 1

STAT Chapter 13: Categorical Data. Recall we have studied binomial data, in which each trial falls into one of 2 categories (success/failure).

Instrumental Variables Again 1

The Multivariate Normal Distribution 1

Chapter 2: Describing Contingency Tables - I

2 Describing Contingency Tables

Discrete Multivariate Statistics

STA442/2101: Assignment 5

Topic 21 Goodness of Fit

STA 2101/442 Assignment 2 1

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

The Multivariate Normal Distribution 1

STAC51: Categorical data Analysis

An introduction to biostatistics: part 1

Testing Independence

Lecture 25. Ingo Ruczinski. November 24, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University

Measures of Association and Variance Estimation

Unit 9: Inferences for Proportions and Count Data

Probability and Probability Distributions. Dr. Mohammed Alahmed

ST3241 Categorical Data Analysis I Two-way Contingency Tables. 2 2 Tables, Relative Risks and Odds Ratios

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Unit 9: Inferences for Proportions and Count Data

Categorical Data Analysis Chapter 3

Regression Part II. One- factor ANOVA Another dummy variable coding scheme Contrasts Mul?ple comparisons Interac?ons

Two-sample Categorical data: Testing

UNIVERSITY OF TORONTO Faculty of Arts and Science

Continuous Random Variables 1

Chapter 10. Chapter 10. Multinomial Experiments and. Multinomial Experiments and Contingency Tables. Contingency Tables.

Lecture 01: Introduction

11-2 Multinomial Experiment

TUTORIAL 8 SOLUTIONS #

Computational Systems Biology: Biology X

PHASES OF STATISTICAL ANALYSIS 1. Initial Data Manipulation Assembling data Checks of data quality - graphical and numeric

Lecture 23. November 15, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

Goodness of Fit Tests: Homogeneity

STAT 705: Analysis of Contingency Tables

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI

Cohen s s Kappa and Log-linear Models

Statistics of Contingency Tables - Extension to I x J. stat 557 Heike Hofmann

Statistics Handbook. All statistical tables were computed by the author.

Multiple Sample Categorical Data

Introduction to Bayesian Statistics 1

Statistics in medicine

n y π y (1 π) n y +ylogπ +(n y)log(1 π).

Glossary for the Triola Statistics Series

Lecture 8: Summary Measures

Contingency Tables. Safety equipment in use Fatal Non-fatal Total. None 1, , ,128 Seat belt , ,878

PubH 7470: STATISTICS FOR TRANSLATIONAL & CLINICAL RESEARCH

Chapter 26: Comparing Counts (Chi Square)

Review of One-way Tables and SAS

Structural Equa+on Models: The General Case. STA431: Spring 2013

Sections 2.3, 2.4. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis 1 / 21

Statistics for Managers Using Microsoft Excel

Generalized logit models for nominal multinomial responses. Local odds ratios

Business Analytics and Data Mining Modeling Using R Prof. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee

Sociology 6Z03 Review II

STAC51H3:Categorical Data Analysis Assign 2 Due: Thu Feb 9, 2016 in class All relevant work must be shown for credit.

Poisson regression: Further topics

Structural Equa+on Models: The General Case. STA431: Spring 2015

Inverse Sampling for McNemar s Test

Logistic regression modeling the probability of success

Module 10: Analysis of Categorical Data Statistics (OA3102)

Lecture 10: Generalized likelihood ratio test

Hypothesis Testing, Power, Sample Size and Confidence Intervals (Part 2)

Chapter 2: Describing Contingency Tables - II

Statistics: revision

Inference for Categorical Data. Chi-Square Tests for Goodness of Fit and Independence

Chi-square (χ 2 ) Tests

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Lecture 24. Ingo Ruczinski. November 24, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University

Mathematical Notation Math Introduction to Applied Statistics

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Contingency Tables. Contingency tables are used when we want to looking at two (or more) factors. Each factor might have two more or levels.

Within Cases. The Humble t-test

Chi-square (χ 2 ) Tests

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing

Transcription:

Contingency Tables Part One 1 STA 312: Fall 2012 1 See last slide for copyright information. 1 / 32

Suggested Reading: Chapter 2 Read Sections 2.1-2.4 You are not responsible for Section 2.5 2 / 32

Overview 1 Definitions 2 Study Designs and Models 3 Odds ratio 4 Testing Independence 3 / 32

We are interested in relationships between variables A contingency table is a joint frequency distribution. No Vitamin C 500 mg. or more Daily No Pneumonia Pneumonia A contingency table Counts the number of cases in combinations of two (or more) categorical variables In general, X has I categories and Y has J categories Often, X is the explanatory variable and Y is the response variable (like regression). 4 / 32

Cell probabilities π ij i = 1,..., I and j = 1,..., J Passed the Course Course Did not pass Passed Total Catch-up π 11 π 12 π 1+ Mainstream π 21 π 22 π 2+ Elite π 31 π 32 π 3+ Total π +1 π +2 1 Marginal probabilities J P r{x = i} = π ij = π i+ P r{y = j} = j=1 I π ij = π +j i=1 5 / 32

Conditional probabilities P r{y = j X = i} = P r{y = j X = i} P r{x = i} = π ij π i+ Passed the Course Course Did not pass Passed Total Catch-up π 11 π 12 π 1+ Mainstream π 21 π 22 π 2+ Elite π 31 π 32 π 3+ Total π +1 π +2 1 Usually, interest is in the conditional distribution of the response variable given the explanatory variable. Sometimes, we make tables of conditional probabilities 6 / 32

Cell frequencies Passed the Course Course Did not pass Passed Total Catch-up n 11 n 12 n 1+ Mainstream n 21 n 22 n 2+ Elite n 31 n 32 n 3+ Total n +1 n +2 n 7 / 32

For example Passed the Course Course Did not pass Passed Total Catch-up 27 8 35 Mainstream 124 204 328 Elite 7 24 31 Total 158 236 394 8 / 32

Estimating probabilities Should we just estimate π ij with p ij = n ij n? Sometimes. It depends on the study design. The study design determines exactly what is in the tables 9 / 32

Study designs Cross-sectional Prospective Retrospective 10 / 32

Cross-sectional design Both variables in the table are measured with No assignment of cases to experimental conditions No selection of cases based on variable values For example, a sample of n first-year university students sign up for one of three calculus courses, and each student either passes the course or does not. Total sample size n is fixed by the design. Multinomial model, with c = IJ categories. Estimate π ij with p ij Estimating conditional probabilities is easy. 11 / 32

Prospective design Prospective means looking forward (from explanatory to response). Groups that define the explanatory variable categories are formed before the response variable is observed. Experimental studies with random assignment are prospective (clinical trials). Cohort studies that follow patients who got different types of surgery. Stratified sampling, like interview 200 people from each province. Marginal totals of the explanatory variable are fixed by the design. Assume random sampling within each category defined by the explanatory variable, and independence between categories. Product multinomial model: A product of I multinomial likelihoods. Good for estimating conditional probability of response given a value of the explanatory variable. 12 / 32

Product multinomial Take independent random samples of sizes n 1+,..., n I+ from I sub-populations. In each, observe a multinomial with J categories. Compare. Example: Sample 100 entring students from each of three campuses. At the end of first year, observe whether they are in good standing, on probation, or have left the university. The π ij are now conditional probabilities: π 1+ = 1 Write the likelihood as 3 i=1 [π n i1 i1 πn i2 i2 (1 π i1 π i2 ) n i3 )] 13 / 32

Retrospective design Retrospective means looking backward (from response to explanatory). In a case control study, a sample of patients with a disease is compared to a sample without the disease, to discover variables that might have caused the disease. Vitamin C and Pneumonia (fairly rare, even in the elderly) Marginal totals for the response variable are fixed by the design. Product multinomial again Natural for estimating conditional probability of explanatory variable given response variable. Usually that s not what you want. But if you know the probability of having the disease, you can use Bayes Theorem to estimate the conditional probabilities in the more interesting direction. 14 / 32

Meanings of X and Y unrelated Conditional distribution of Y X = x is the same for every x Conditional distribution of X Y = y is the same for every y X and Y are independent (if both are random) If variables are not unrelated, call them related. 15 / 32

Put probabilities in table cells Y = 1 Y = 2 Total X = 1 π 11 π 12 π 11 + π 12 X = 2 π 21 π 22 π 21 + π 22 Total π 11 + π 21 π 12 + π 22 P r{y = 1 X = 1} = π 11 π 11 + π 12 16 / 32

Conditional distribution of Y given X = x Same for all values of x Y = 1 Y = 2 Total X = 1 π 11 π 12 π 11 + π 12 X = 2 π 21 π 22 π 21 + π 22 Total π 11 + π 21 π 12 + π 22 P r{y = 1 X = 1} = P r{y = 1 X = 2} π 11 π 21 = π 11 + π 12 π 21 + π 22 π 11 (π 21 + π 22 ) = π 21 (π 11 + π 12 ) π 11 π 21 + π 11 π 22 = π 11 π 21 + π 12 π 21 π 11 π 22 = π 12 π 21 π 11π 22 π 12 π 21 = θ = 1 17 / 32

Cross product ratio Y = 1 Y = 2 Total X = 1 π 11 π 12 π 11 + π 12 X = 2 π 21 π 22 π 21 + π 22 Total π 11 + π 21 π 12 + π 22 θ = π 11π 22 π 12 π 21 18 / 32

Conditional distribution of X given Y = y Same for all values of y Y = 1 Y = 2 Total X = 1 π 11 π 12 π 11 + π 12 X = 2 π 21 π 22 π 21 + π 22 Total π 11 + π 21 π 12 + π 22 P r{x = 1 Y = 1} = P r{x = 1 Y = 2} π 11 π 12 = π 11 + π 21 π 12 + π 22 π 11 (π 12 + π 22 ) = π 12 (π 11 + π 21 ) π 11 π 12 + π 11 π 22 = π 11 π 12 + π 12 π 21 π 11 π 22 = π 12 π 21 π 11π 22 π 12 π 21 = θ = 1 19 / 32

X and Y independent Meaningful in a cross-sectional design Write the probability table as π = x a x a b x 1 a b + x 1 a b 1 b 1 Independence means P (X = x, Y = y) = P (X = x)p (Y = y). If x = ab then π = ab a(1 b) a b(1 a) (1 a)(1 b) 1 a b 1 b 1 And the cross-product ratio θ = 1. 20 / 32

Conversely x a x a b x 1 a b + x 1 a b 1 b 1 If θ = 1, then x(1 a b + x) = (a x)(b x) x ax bx x 2 = ab ax bx x 2 x = ab Meaning X and Y are independent. 21 / 32

What we have learned about the cross-product ratio θ In a 2 2 table, θ = 1 if and only if the variables are unrelated, no matter how unrelated is expressed. Conditional distribution of Y X = x is the same for every x Conditional distribution of X Y = y is the same for every y X and Y are independent (if both are random) It s meaningful for all three study designs: Prospective, Retrospective and Cross-sectional. Investigate θ a bit more. 22 / 32

Odds Denoting the probability of an event by π, Odds = π 1 π. π 1 π Implicitly, we are saying the odds are to one. if the probability of the event is π = 2/3, then the odds are = 2, or two to one. 2/3 1/3 Instead of saying the odds are 5 to 2, we d say 2.5 to one. Instead of saying 1 to four, we d say 0.25 to one. The higher the probability, the greater the odds. And as the probability of an event approaches one, the denominator of the odds approaches zero. This means the odds can be any non-negative number. 23 / 32

Odds ratio Conditional Odds is an idea that makes sense. Just use a conditional probability to calculate the odds. Consider the ratio of the odds of Y = 1 given X = 1 to the odds of Y = 1 given X = 2. Could say something like The odds of cancer are 3.2 times as great for smokers. / Odds(Y = 1 X = 1) P (Y = 1 X = 1) P (Y = 1 X = 2) = Odds(Y = 1 X = 2) P (Y = 2 X = 1) P (Y = 2 X = 2) 24 / 32

Simplify the odds ratio Y = 1 Y = 2 Total X = 1 π 11 π 12 π 1+ X = 2 π 21 π 22 π 2+ Total π +1 π +2 1 Odds(Y = 1 X = 1) Odds(Y = 1 X = 2) = P (Y = 1 X = 1) P (Y = 2 X = 1) / π21 /π 2+ = π 11/π 1+ π 12 /π 1+ = π 11π 22 π 12 π 21 = θ π 22 /π 2+ / P (Y = 1 X = 2) P (Y = 2 X = 2) So the cross-product ratio is actually the odds ratio. 25 / 32

The cross-product ratio is the odds ratio When θ = 1, The odds of Y = 1 given X = 1 equal the odds of Y = 1 given X = 2. This happens if and only if X and Y are unrelated. Applies to all 3 study designs. If θ > 1, the odds of Y = 1 given X = 1 are greater than the odds of Y = 1 given X = 2. If θ < 1, the odds of Y = 1 given X = 1 are less than the odds of Y = 1 given X = 2. 26 / 32

Odds ratio applies to larger tables Admitted Not Admitted Dept. A 601 332 Dept. B 370 215 Dept. C 322 596 Dept. D 269 523 Dept. E 147 437 Dept. F 46 668 The (estimated) odds of being accepted are θ = (601)(668) (332)(46) = 26.3 times as great in Department A, compared to Department F. 27 / 32

Some things to notice About the odds ratio The cross-product (odds) ratio is meaningful for large tables; apply it to 2x2 sub-tables. Re-arrange rows and columns as desired to get the cell you want in the upper left position. Combining rows or columns (especially columns) by adding the frequencies is natural and valid. If you hear something like Chances of death before age 50 are four times as great for smokers, most likely they are talking about an odds ratio. 28 / 32

Testing independence with large samples For cross-sectional data Passed the Course Course Did not pass Passed Total Catch-up π 11 π 12 π 1+ Mainstream π 21 π 22 π 2+ Elite π 31 π 32 1 π 1+ π 2+ Total π +1 1 π +1 1 Under H 0 : π ij = π i+ π +j There are (I 1) + (J 1) free parameters: The marginal probabilities. MLEs of marginal probabilities are π i+ = p i+ and π +j = p +j Restricted MLEs are π ij = p i+ p +j The null hypothesis reduces the number of free parameters in the model by (IJ 1) (I 1 + J 1) = (I 1)(J 1) So the test has (I 1)(J 1) degrees of freedom. 29 / 32

Estimated expected frequencies Under the null hypothesis of independence µ ij = n π ij = n π i+ π +j = n p i+ p +j = n n i+ n = n i+n +j n n +j n (Row total) (Column total) (Total total) 30 / 32

Test statistics For testing independence I G 2 = 2 i=1 j=1 J n ij log ( nij µ ij ) X 2 = I J (n ij µ ij ) 2 µ i=1 j=1 ij With expected frequencies µ ij = n i+n +j n And degrees of freedom = (Row total) (Column total) Total total df = (I 1)(J 1) 31 / 32

Copyright Information This slide show was prepared by Jerry Brunner, Department of Statistics, University of Toronto. It is licensed under a Creative Commons Attribution - ShareAlike 3.0 Unported License. Use any part of it as you like and share the result freely. The L A TEX source code is available from the course website: http://www.utstat.toronto.edu/ brunner/oldclass/312f12 32 / 32