Review of One-way Tables and SAS

Similar documents
One-Way Tables and Goodness of Fit

UNIVERSITY OF TORONTO Faculty of Arts and Science

Testing Independence

Sections 3.4, 3.5. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Section VII. Chi-square test for comparing proportions and frequencies. F test for means

Categorical Variables and Contingency Tables: Description and Inference

STAT 525 Fall Final exam. Tuesday December 14, 2010

Solutions for Examination Categorical Data Analysis, March 21, 2013

STAT 135 Lab 11 Tests for Categorical Data (Fisher s Exact test, χ 2 tests for Homogeneity and Independence) and Linear Regression

Inference for Binomial Parameters

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 18.1 Logistic Regression (Dose - Response)

Two-sample Categorical data: Testing

Categorical Data Analysis Chapter 3

STAT 705: Analysis of Contingency Tables

The t-distribution. Patrick Breheny. October 13. z tests The χ 2 -distribution The t-distribution Summary

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Lecture 9. Selected material from: Ch. 12 The analysis of categorical data and goodness of fit tests

Log-linear Models for Contingency Tables

STAT 7030: Categorical Data Analysis

Multinomial Logistic Regression Models

Topic 21 Goodness of Fit

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

The Multinomial Model

HYPOTHESIS TESTING: THE CHI-SQUARE STATISTIC

Part 1.) We know that the probability of any specific x only given p ij = p i p j is just multinomial(n, p) where p k1 k 2

Unit 9: Inferences for Proportions and Count Data

Institute of Actuaries of India

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

The material for categorical data follows Agresti closely.

Unit 9: Inferences for Proportions and Count Data

Lecture 25. Ingo Ruczinski. November 24, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University

Inference for Categorical Data. Chi-Square Tests for Goodness of Fit and Independence

Multiple Sample Categorical Data

10.2: The Chi Square Test for Goodness of Fit

Module 10: Analysis of Categorical Data Statistics (OA3102)

Sleep data, two drugs Ch13.xls

ˆπ(x) = exp(ˆα + ˆβ T x) 1 + exp(ˆα + ˆβ T.

Statistics 3858 : Contingency Tables

Hypothesis Testing hypothesis testing approach

ij i j m ij n ij m ij n i j Suppose we denote the row variable by X and the column variable by Y ; We can then re-write the above expression as

Chi-Squared Tests. Semester 1. Chi-Squared Tests

Cohen s s Kappa and Log-linear Models

Mathematical statistics

Chi Square Analysis M&M Statistics. Name Period Date

ST3241 Categorical Data Analysis I Two-way Contingency Tables. Odds Ratio and Tests of Independence

11 CHI-SQUARED Introduction. Objectives. How random are your numbers? After studying this chapter you should

Epidemiology Wonders of Biostatistics Chapter 11 (continued) - probability in a single population. John Koval

Chi-Square. Heibatollah Baghi, and Mastee Badii

Two-sample Categorical data: Testing

Variance Estimates and the F Ratio. ERSH 8310 Lecture 3 September 2, 2009

Study Ch. 13.1, # 1 4 all Study Ch. 13.2, # 9 15, 25, 27, 31 [# 11 17, ~27, 29, ~33]

Precept 4: Hypothesis Testing

Some comments on Partitioning

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

Good Confidence Intervals for Categorical Data Analyses. Alan Agresti

:the actual population proportion are equal to the hypothesized sample proportions 2. H a

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

Introduction to Statistical Analysis. Cancer Research UK 12 th of February 2018 D.-L. Couturier / M. Eldridge / M. Fernandes [Bioinformatics core]

STAT Chapter 13: Categorical Data. Recall we have studied binomial data, in which each trial falls into one of 2 categories (success/failure).

n y π y (1 π) n y +ylogπ +(n y)log(1 π).

Statistics - Lecture 04

Chapter 5: Logistic Regression-I

Department of Mathematics & Statistics STAT 2593 Final Examination 17 April, 2000

Statistical Data Analysis Stat 3: p-values, parameter estimation

Fundamental Probability and Statistics

Goodness of Fit Goodness of fit - 2 classes

STATISTICS SYLLABUS UNIT I

Ordinal Variables in 2 way Tables

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

8 Nominal and Ordinal Logistic Regression

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Chapter 8. The analysis of count data. This is page 236 Printer: Opaque this

Psych Jan. 5, 2005

10: Crosstabs & Independent Proportions

ML Testing (Likelihood Ratio Testing) for non-gaussian models

Hypothesis Testing One Sample Tests

Lecture 10: Generalized likelihood ratio test

BMI 541/699 Lecture 22

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing

Exam Applied Statistical Regression. Good Luck!

ME3620. Theory of Engineering Experimentation. Spring Chapter IV. Decision Making for a Single Sample. Chapter IV

STAC51: Categorical data Analysis

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

Chapter 1 Statistical Inference

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Contingency Tables Part One 1

1. Hypothesis testing through analysis of deviance. 3. Model & variable selection - stepwise aproaches

Stat 642, Lecture notes for 04/12/05 96

14.30 Introduction to Statistical Methods in Economics Spring 2009

Lecture 8: Summary Measures

Chapter 11: Analysis of matched pairs

INTRODUCTION TO ANALYSIS OF VARIANCE

ST3241 Categorical Data Analysis I Two-way Contingency Tables. 2 2 Tables, Relative Risks and Odds Ratios

Introduction to General and Generalized Linear Models

Homework 5: Answer Key. Plausible Model: E(y) = µt. The expected number of arrests arrests equals a constant times the number who attend the game.

Discrete Multivariate Statistics

Is the cholesterol concentration in blood related to the body mass index (bmi)?

Stat 5421 Lecture Notes Simple Chi-Square Tests for Contingency Tables Charles J. Geyer March 12, 2016

Transcription:

Stat 504, Lecture 7 1 Review of One-way Tables and SAS In-class exercises: Ex1, Ex2, and Ex3 from http://v8doc.sas.com/sashtml/proc/z0146708.htm To calculate p-value for a X 2 or G 2 in SAS: http://v8doc.sas.com/sashtml/lgref/z0245929.htmz0845409 use PROBCHI function. For example, if X 2 = 0.47 with df = 2, then the p-value=1-probchi(0.47,2)

Stat 504, Lecture 7 2 Introduction to Two-Way Tables Example 1: 2 2 Table of counts and/or proportions Table 1: Incidence of Common Colds involving French Skiers (Pauling(1971) as reported in Fienberg(1980) Cold No Cold Totals Placebo 31 109 140 Absorbic Acid 17 122 139 Totals 48 231 279

Stat 504, Lecture 7 3 Table 2: Incidence of Common Colds involving French Skiers (Pauling(1971) as reported in Fienberg(1980) Cold No Cold Totals Placebo 0.111 0.391 0.502 Absorbic Acid 0.601 0.437 0.498 Totals 0.172 0.828 1 Q1: Compare relative frequency of occurrence of some characteristics of two groups, e.g. is a probability of a member of the placebo group contracting a cold same as a probability of a member for the ascorbic group contracting a cold? Q2: Are two characteristics independent, e.g. are a type of treatment and contracting cold associated? Q3: Is one characteristic a cause for another, e.g. does having a therapeutic value of ascorbic acid (vitamin C) prevent contracting a cold?

Stat 504, Lecture 7 4 Suppose that we collect data on two binary variables, Y and Z. Binary means that these variables take two possible values, say 1 (e.g. cold ) and 2 (e.g. no cold ). Suppose we collect values of Y (e.g. treatment) and Z (e.g. contracting cold) for n sample units. The data then consist of n pairs, (y 1, z 1 ), (y 2, z 2 ),..., (y n, z n ). We can summarize the data in a frequency table. Let x ij be the number of sample units having Y = i and Z = j. Then x = (x 11, x 12, x 21, x 22 ) is a summary of all n responses, e.g x 11 = 31. We could display x as a one-way table with four cells, but it is customary to display x as a square table with two rows and two columns: Z = 1 Z = 2 Y = 1 x 11 x 12 Y = 2 x 21 x 22

Stat 504, Lecture 7 5 Marginal totals. When a subscript in a cell count x ij is replaced by a plus sign (+), it will mean that we have taken the sum of the cell counts over that subscript. The row totals are the column totals are and the grand total is x 1+ = x 11 + x 12, x 2+ = x 21 + x 22, x +1 = x 11 + x 21, x +2 = x 12 + x 22, x ++ = x 11 + x 12 + x 21 + x 22 = n. These quantities are often called marginal totals, because they are conveniently placed in the margins of the table, like this. Z = 1 Z = 2 total Y = 1 x 11 x 12 x 1+ Y = 2 x 21 x 22 x 2+ total x +1 x +2 x ++

Stat 504, Lecture 7 6 If the sample units are randomly sampled from a large population, then x = (x 11, x 12, x 21, x 22 ) will have a multinomial distribution with index n = x ++ and parameter vector π = (π 11, π 12, π 21, π 22 ), where π ij = P (Y = i, Z = j). Z = 1 Z = 2 total Y = 1 π 11 π 12 π 1+ Y = 2 π 21 π 22 π 2+ total π +1 π +2 π ++ = 1 The probability distribution {π ij } is the joint distribution of Y and Z. When you sum the joint probabilities, you get a marginal distribution, e..g the probability distribution {π i+ } is the marginal distribution for Y where P (Y = 1) = π 1+ and P (Y = 2) = π 2+. How does the distribution of Z change as the category of Y changes? The conditional distribution of Z given Y, for example, is {π j i } = π ij π i+, such that P j π j i = 1.

Stat 504, Lecture 7 7 In class exercise: What is the observed conditional probability distribution P ( cold treatment )?

Stat 504, Lecture 7 8 Under a general multinomial model, the π vector contains three unknown parameters. The general multinomial model is often called the saturated model, because it contains the maximum number of unknown parameters. Explore geometry of 2 2 tables: http://www 2.cs.cmu.edu/ eairoldi/tetrahedron3d/

Stat 504, Lecture 7 9 The independence model Given a 2 2 table, it is natural to ask how Y and Z are related. Suppose for the moment that there is no relationship between Y and Z, i.e. that they are independent. Independence means that π ij = P (Y = i, Z = j) = P (Y = i) P (Z = j) for i, j = 1, 2. Let P (Y = 1) = α and P (Z = 1) = β, so that P (Y = 2) = 1 α and P (Z = 2) = 1 β. Under independence, we have π 11 = P (Y = 1) P (Z = 1) = αβ, (1) π 12 = P (Y = 1) P (Z = 2) = α(1 β), (2) π 21 = P (Y = 2) P (Z = 1) = (1 α)β, (3) π 22 = P (Y = 2) P (Z = 2) = (1 α)(1 β).(4)

Stat 504, Lecture 7 10 Note that α = π 1+ = π 11 + π 12, 1 α = π 2+ = π 21 + π 22, β = π +1 = π 11 + π 21, 1 β = π +2 = π 12 + π 22, so the condition of independence can be conveniently written as π ij = π i+ π +j, i, j = 1, 2. (5) The primary reason that we introduced the symbols α and β for π 1+ and π +1 is to emphasize that under the independence model, there are only two unknown parameters. Once α and β are known, the vector π can be found using (1) (4). The independence model is a submodel of (i.e. a special case of) the saturated model that satisfies the constraints (5).

Stat 504, Lecture 7 11 Test of independence The hypothesis of independence can be tested using the general method described in Lecture 4. To test H 0 : the independence model is true versus H 1 : the saturated model is true, do the following. First, estimate α and β, the unknown parameters of the independence model. Second, calculate estimated cell probabilities and expected frequencies from the estimated α and β. Third, calculate X 2 and/or G 2 and compare them to the appropriate chisquare distribution.

Stat 504, Lecture 7 12 How can we estimate α and β? Under H 0, Y (e.g. treatment ) and Z (e.g. cold ) provide no information about one another, so we can estimate the parameters of their distributions separately. Note that x 1+ Bin(n, α) (6) and x +1 Bin(n, β), (7) and under H 0 (6) and (7) are independent.

Stat 504, Lecture 7 13 Therefore, the ML estimates of α and β are ˆα = x 1+ n and ˆβ = x +1 n. Plugging these estimates into (1) (4) gives estimated probabilities ˆπ 11 = x 1+ n ˆπ 21 = x 2+ n x +1 n, ˆπ 12 = x 1+ n x +1 n, ˆπ 22 = x 2+ n x +2 n, x +2 n, and estimated expected cell counts E 11 = nˆπ 11 = x 1+x +1 n E 21 = nˆπ 21 = x 2+x +1 n, E 12 = nˆπ 12 = x 1+x +2 n, E 22 = nˆπ 22 = x 2+x +2 n These four formulas are conveniently summarized as,. E ij = x i+x +j n, i, j = 1, 2, which can be easily remembered as expected frequency = row total column total. grand total

Stat 504, Lecture 7 14 Under H 0, both X 2 and G 2 are approximately χ 2 provided that the expected counts E ij are sufficiently large. Under H 0 the model has 2 unknown parameters, whereas under H 1 there are 3 unknowns. The degrees of freedom are therefore ν = 3 2 = 1. A large value of X 2 or G 2 indicates that the independence model is not plausible, and thus that Y and Z are related. The 95th percentile of χ 2 1 is 3.96, so an observed value of X 2 or G 2 greater than 4 means that we can reject the null hypothesis of independence at the.05 level.

Stat 504, Lecture 7 15 The test for independence in a 2 2 table is a special case of the general goodness-of-fit test discussed in Lecture 5 and 6. Therefore, all of the caveats regarding goodness-of-fit tests discussed there apply to this test also. For the chisquare approximation to work well, the E ij s need to be sufficiently large. The iid assumption for the n sample units must be satisfied; there should be no clustering in the data.

Stat 504, Lecture 7 16 Example. Suppose that in a sample of n = 300 hospital patients, 90 are overweight, 90 are hypertensive, and 30 are both overweight and hypertensive. Is there evidence of a relationship between these two conditions? The observed data are shown below. not hypertensive hypertensive total overweight 30 60 90 not overweight 60 150 210 total 90 210 300 The expected cell counts for the four cells are E 11 = E 21 = 90 90 300 210 90 300 = 27, E 12 = = 63, E 22 = The goodness-of-fit statistics are 90 210 300 210 210 300 = 63, = 147. X 2 = (30 27)2 27 + + (150 147)2 147 (60 63)2 63 = 0.68, + (60 63)2 63

Stat 504, Lecture 7 17 G 2 = 2 30 log 30 27 + 150 log 150 147 60 + 60 log 63 «= 0.67. + 60 log 60 63 These do not exceed 4, so we cannot reject the independence model at the.05 level. An approximate p-value is P (χ 2 1.68) =.40. On the basis of these data, there is little evidence of a relationship between the two conditions.

Stat 504, Lecture 7 18 The test for independence in a 2 2 table can be done in Minitab using the chisq command: MTB > read c1-c2 DATA> 30 60 DATA> 60 150 DATA> end 2 rows read. MTB > chisq c1-c2 Expected counts are printed below observed counts C1 C2 Total 1 30 60 90 27.00 63.00 2 60 150 210 63.00 147.00 Total 90 210 300 ChiSq = 0.333 + 0.143 + 0.143 + 0.061 = 0.680 df = 1 Note that Minitab gives only Pearson s X 2. Calculating the deviance G 2 in Minitab is a little more tedious. One way to do it is to enter the cell counts in a single column, say, C1. Then enter the row sums and column sums in C2 and C3, respectively. Then calculate the expected cell counts and put them

Stat 504, Lecture 7 19 into C4. MTB > set c1 # enter observed counts DATA> 30 60 60 150 DATA> end MTB > set c2 # enter row sums DATA> 90 90 210 210 DATA> end MTB > set c3 # enter column sums DATA> 90 210 90 210 DATA> end MTB > let c4 = c2*c3/300 # calculate expected counts MTB > let k1 = 2*sum(c1*log(c1/c4)) # calculate G^2 MTB > print k1 K1 0.672805

Stat 504, Lecture 7 20 In R or S-PLUS the Pearson X 2 -test is easily carried out using the chisq() function. By default, this function employs the continuity correction proposed by Yates (1934) for a 2 2 table. This correction is not universally regarded as appropriate, however, so we will not use it. To turn off the Yates correction, include correct=f as an argument to the chisq() function. > x_c(30,60,60,150) # enter data > x_matrix(x,2,2) # convert to a matrix > chisq.test(x,correct=f) Pearson s chi-square test without Yates continuity correction data: x X-squared = 0.6803, df = 1, p-value = 0.4095 To calculate G 2 in R or S-PLUS, you need to go through essentially the same steps as in Minitab. > ob_c(30,60,60,150) > rsum_c(90,90,210,210) > csum_c(90,210,90,210) > ex_rsum*csum/300 > G2_2*sum(ob*log(ob/ex)) > G2 [1] 0.6728037

Stat 504, Lecture 7 21 In SAS the function under PROC FREQ is chisq and for two-way tables and above will give you both the Pearson X 2 statistic and the deviance, G 2. See: http://v8doc.sas.com/sashtml/proc/zreq-ex3.htm

Stat 504, Lecture 7 22 Multinomial sampling: In one type of experiment, we draw a sample of n = x ++ subjects from a population and record (Y, Z) for each subject. Then the joint distribution of {x ij } is multinomial with index n and parameter π = {π ij }, π ij = P (Y = i, Z = j). Where the grand total n is fixed and known. Sometimes we express the parameter as the cell means m ij = E(x ij ) = nπ ij.

Stat 504, Lecture 7 23 Poisson sampling: x ij Poisson(m ij ) independently for i = 1,..., I and j = 1,..., J. In this scheme, the overall n is not fixed. Example: You sit by the roadside for one hour with a radar gun, checking the speed of each car as it passes by. You record Y = color of the car (1=black, 2=white, 3=red, 4=other) and Z = whether the car s speed exceeds the legal limit (1=yes, 2=no).

Stat 504, Lecture 7 24 In Lecture 4, we argued that the likelihood function may be factored into the product of a Poisson likelihood for n, n Poisson(m ++ ) and a multinomial likelihood for {x ij } given n, with parameters π ij = m ij m ++. The total n provides no information about π = {π ij }. From a likelihood standpoint, we get the same inferences about π whether n is regarded as fixed or random. Therefore, if m ++ is not of interest, Poisson data may be analyzed as if it were multinomial. Conversely, if data are multinomial, we may analyze them as if they were Poisson. The inferences for π are valid, and the inferences for m ++ should be ignored.

Stat 504, Lecture 7 25 Product-multinomial sampling: Decide beforehand that we will draw x i+ subjects with characteristic Y = i (i = 1,..., I) and record the Z-value for each one. In this scenario, each row of the table (x i1, x i2,..., x ij ) T is multinomial with probabilities π j i = π ij /π i+ and the rows are independent. Viewing the data as product-multinomial is appropriate when the row totals truly are fixed by design, as in stratified random sampling (strata defined by Y ) an experiment where Y =treatment group It s also appropriate when the row totals are not fixed, but we are interested in P (Z Y ) and not P (Y ). That is, when Z is the outcome of interest, and Y is an explanatory variable that we do not wish to model.

Stat 504, Lecture 7 26 Suppose the data are multinomial. Then by results from Lecture 4, we may factor the likelihood into two parts: a multinomial likelihood for the row totals (x 1+, x 2+,..., x I+ ) T with index n and parameter {π i+ } I independent multinomial likelihoods for the rows, (x i1, x i2,..., x ij ) T, with parameters {π j i = π ij /π i+ }. Therefore, if the parameters of interest to us can be expressed as functions only of the π j i s and not the π i+ s, then correct likelihood-based inferences may be obtained by treating the data as if they were product-multinomial. Conversely, if the data are product-multinomial, then correct likelihood-based inferences about functions of the π j i s will be obtained if we analyze the data as if they were multinomial. We may also treat them as Poisson, ignoring any inferences about m ++ or m i+.

Stat 504, Lecture 7 27 Hypergeometric sampling: In a few rare examples, we may encounter data where both the row totals (x 1+,..., x I+ ) T and the column totals (x +1,..., x +J ) T are fixed by design. The best-known example of this is Fisher s hypothetical example of the lady tasting tea, which we will discuss soon. In a 2 2 table, the resulting sampling distribution is hypergeometric. Even when both sets of marginal totals are not fixed by design, some statisticians like to condition on them and perform exact inference when the sample size is small and asymptotic approximations are unlikely to work well. Methods for exact inference will be discussed later.

Stat 504, Lecture 7 28 Next lecture: Suggested reading: Ch.2 and Ch. 3 of Agresti Next week we ll cover the test of independence, measures of association and exact tests for 2 2 and I J tables There is no regular homework assignment due next week. However, there is an EXTRA credit assignment due on Tuesday, Feb. 8, 2005. 1. For the French skier example, are two variables independent; i.e. are the treatment and response independent? 2. What seems to be the most reasonable sampling scheme for this problem?; e.g. if you are to design the study which sampling model discussed today would you apply and why? 3. Read the on-line information (example) on analysis of 2 2 tables in SAS (see slide 21). Run the analysis of the overweight example in SAS. Submit your code and compare your results to what we got in class today. What s the most appropriate sampling model for this example and why?