Contingency Tables. Contingency tables are used when we want to looking at two (or more) factors. Each factor might have two more or levels.

Similar documents
Contingency Tables. Safety equipment in use Fatal Non-fatal Total. None 1, , ,128 Seat belt , ,878

Categorical Data Analysis. The data are often just counts of how many things each category has.

Hypothesis testing I. - In particular, we are talking about statistical hypotheses. [get everyone s finger length!] n =

One sided tests. An example of a two sided alternative is what we ve been using for our two sample tests:

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

Hypothesis Testing: Chi-Square Test 1

Violating the normal distribution assumption. So what do you do if the data are not normal and you still need to perform a test?

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math.

Chapter 10: Chi-Square and F Distributions

This is particularly true if you see long tails in your data. What are you testing? That the two distributions are the same!

Lecture 9. Selected material from: Ch. 12 The analysis of categorical data and goodness of fit tests

MITOCW ocw f99-lec09_300k

Descriptive Statistics (And a little bit on rounding and significant digits)

Correlation. We don't consider one variable independent and the other dependent. Does x go up as y goes up? Does x go down as y goes up?

- measures the center of our distribution. In the case of a sample, it s given by: y i. y = where n = sample size.

Chapter 26: Comparing Counts (Chi Square)

Two sided, two sample t-tests. a) IQ = 100 b) Average height for men = c) Average number of white blood cells per cubic millimeter is 7,000.

Statistics 3858 : Contingency Tables

Chapter 10. Chapter 10. Multinomial Experiments and. Multinomial Experiments and Contingency Tables. Contingency Tables.

Is there a connection between gender, maths grade, hair colour and eye colour? Contents

Chapter 1 Review of Equations and Inequalities

STAC51: Categorical data Analysis

Chi Square Analysis M&M Statistics. Name Period Date

MITOCW ocw f99-lec23_300k

Lecture 41 Sections Mon, Apr 7, 2008

Note that we are looking at the true mean, μ, not y. The problem for us is that we need to find the endpoints of our interval (a, b).

MITOCW ocw f99-lec05_300k

Dealing with the assumption of independence between samples - introducing the paired design.

Rama Nada. -Ensherah Mokheemer. 1 P a g e

STAT 705: Analysis of Contingency Tables

Note that we are looking at the true mean, μ, not y. The problem for us is that we need to find the endpoints of our interval (a, b).

Fog Chamber Testing the Label: Photo of Fog. Joshua Gutwill 10/29/1999

11-2 Multinomial Experiment

Explanation of R 2, and Other Stories

- a value calculated or derived from the data.

Comparing p s Dr. Don Edwards notes (slightly edited and augmented) The Odds for Success

We're in interested in Pr{three sixes when throwing a single dice 8 times}. => Y has a binomial distribution, or in official notation, Y ~ BIN(n,p).

MITOCW ocw f99-lec01_300k

Lecture 5: ANOVA and Correlation

MITOCW ocw lec8

Basic Business Statistics, 10/e

Lesson 6: Algebra. Chapter 2, Video 1: "Variables"

Lecture 41 Sections Wed, Nov 12, 2008

MITOCW MIT18_01SCF10Rec_24_300k

Solving Equations by Adding and Subtracting

Psych 230. Psychological Measurement and Statistics

Chapter 2: Describing Contingency Tables - I

Statistics Handbook. All statistical tables were computed by the author.

The following are generally referred to as the laws or rules of exponents. x a x b = x a+b (5.1) 1 x b a (5.2) (x a ) b = x ab (5.

Statistics for Managers Using Microsoft Excel

Chapter 10. Prof. Tesler. Math 186 Winter χ 2 tests for goodness of fit and independence

Chapter 6. Logistic Regression. 6.1 A linear model for the log odds

Ling 289 Contingency Table Statistics

To: Amanda From: Daddy Date: 2004 February 19 About: How to solve math problems

11 CHI-SQUARED Introduction. Objectives. How random are your numbers? After studying this chapter you should

Confidence intervals

Two-sample Categorical data: Testing

Statistics 1L03 - Midterm #2 Review

6: Polynomials and Polynomial Functions

MITOCW ocw-18_02-f07-lec17_220k

Testing Independence

( )( b + c) = ab + ac, but it can also be ( )( a) = ba + ca. Let s use the distributive property on a couple of

CHI SQUARE ANALYSIS 8/18/2011 HYPOTHESIS TESTS SO FAR PARAMETRIC VS. NON-PARAMETRIC

Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami

MITOCW ocw-18_02-f07-lec02_220k

Chapter 9. Inferences from Two Samples. Objective. Notation. Section 9.2. Definition. Notation. q = 1 p. Inferences About Two Proportions

Frequency Distribution Cross-Tabulation

Lecture 28 Chi-Square Analysis

We know from STAT.1030 that the relevant test statistic for equality of proportions is:

Examples of frequentist probability include games of chance, sample surveys, and randomized experiments. We will focus on frequentist probability sinc

Elementary Algebra Study Guide Some Basic Facts This section will cover the following topics

Topic 1. Definitions

MITOCW MITRES18_006F10_26_0602_300k-mp4

Business Statistics 41000: Homework # 5

MITOCW ocw f99-lec17_300k

We're in interested in Pr{three sixes when throwing a single dice 8 times}. => Y has a binomial distribution, or in official notation, Y ~ BIN(n,p).

10: Crosstabs & Independent Proportions

Properties of Arithmetic

Chi-Squared Tests. Semester 1. Chi-Squared Tests

The Cycloid. and the Kinematic Circumference. by Miles Mathis

Question. Hypothesis testing. Example. Answer: hypothesis. Test: true or not? Question. Average is not the mean! μ average. Random deviation or not?

ANOVA - analysis of variance - used to compare the means of several populations.

Statistics and Quantitative Analysis U4320

MITOCW R11. Double Pendulum System

Example. χ 2 = Continued on the next page. All cells

Lesson 11-1: Parabolas

CHAPTER 9: HYPOTHESIS TESTING

22s:152 Applied Linear Regression. Chapter 8: 1-Way Analysis of Variance (ANOVA) 2-Way Analysis of Variance (ANOVA)

Mathematical Notation Math Introduction to Applied Statistics

THE SIMPLE PROOF OF GOLDBACH'S CONJECTURE. by Miles Mathis

Introducing Statistics for Bioscientists

Multiple Regression Theory 2006 Samuel L. Baker

PS5: Two Variable Statistics LT3: Linear regression LT4: The test of independence.

Analysis of variance (ANOVA) Comparing the means of more than two groups

13.1 Categorical Data and the Multinomial Experiment

Uni- and Bivariate Power

Introduction to Survey Analysis!

Discrete Multivariate Statistics

MITOCW ocw f99-lec30_300k

Transcription:

Contingency Tables Definition & Examples. Contingency tables are used when we want to looking at two (or more) factors. Each factor might have two more or levels. (Using more than two factors gets complicated, so we won't be looking at this in this class). A factor can be defined as a categorical variable ; examples might be sex, color, age, etc. We already looked at test where we only have one factor - goodness of fit tests. A level cab be defined as the different values for the factor; examples matching the above might be {male, female}, {red, blue, green}, {0-4, 5-9, 10-14, etc.}. For each combination of levels, we have a certain count, which tells us how many individuals (or whatever we re measuring) are in that category. An example (data from 1988, Florida): Safety equipment in use Injury Fatal Non-fatal Total None 1,601 165,527 167,128 Seat belt 510 412,368 412,878 Total 2,111 577,895 580,006 We have two factors, injury and safety equipment used, each with two levels. So now that we have an example, what are we interested in? Depends. In the example above, what would be interesting to know? Do seat-belts save lives? Specifically, is p 1 = p 2?, where: p 1 = proportion of fatalities for people not wearing a seat belt (estimated by p 1 ). p 2 = proportion of fatalities for people wearing a seat belt (estimated by p 2 ). Incidentally, we re obviously interested in a one-sided alternative here (why?).

But let's do from our old textbook. This one's kind of interesting as it was based on a study done in 1899, and then re-analyzed in 1954 by Goodman and Kruskal. Kruskal is famous for being one of the people to develop the Kruskal-Wallis test, which we discuss in slightly more advanced classes: Hair color dark light Total Eye dark 729 131 860 color light 3129 2814 5943 Total 3858 2945 6803 Now what do we want to know? Does it make sense to figure out if the proportion of people with dark eyes is the same if they have dark hair or light hair? Well, maybe. It might be much more interesting to ask the question like this: Does eye color influence hair color, or vice-versa? Or in statistical language: Is eye color independent of hair color? So, depending on the kind of data, we re either interested in: 1) Comparing proportions 2) Establishing independence/dependence So here's an outline of how to do our test: We phrase our hypotheses accordingly: 1) H 0 : p 1 = p 2, where the p's are the true population proportions, H 1 : p 1 p 2 (or p 1 > p 2, etc.) 2) H 0 : Factor 1 and factor 2 are independent H 1 : Factor 1 and factor 2 are dependent We'll notice that the math will be the same regardless of which set of hypotheses we use. Now let s choose α, just as always Calculate our test statistic: χ 2 * = i=1 c (O i E i ) 2 E i

Note that this is identical to that used for the goodness of fit test. But now c is the number of cells in our table (four in both of our examples so far). We ll figure out how to get expected values below. Look up the tabulated χ 2 value Our degrees of freedom are not c-1 anymore. Instead, they are (r-1) x (k-1), where: r = # of rows k = # of columns So for both our examples so far we have: (2-1) x (2-1) = 1 x 1 = 1 Compare our χ 2* with χ 2 table, and if it s larger (or equal to), then reject H 0. State our conclusion in terms of our original hypothesis. So what about our expected values? Let's work with the proportion of people with dark eyes and suppose that hair color has no effect on eye color. This implies that the proportion of people with dark eyes is the same in both columns. Therefore, we figure out what our overall proportion of people with dark eyes using the row totals: Add up all people with dark eyes, and divide this by the total number of people in our sample (in other words, we use the row totals). Now we note that if it doesn't make any difference if you have light or dark hair, then the proportion of people with dark eyes should be the same in both columns. In other words, we now multiply the column totals by the proportion of people with dark eyes. This give us the expected number of people with dark eyes for each column. If we think about this a bit, it gives us the following (easy to remember) formula: Expected value = (Row total) (Column total) (Grand total)

So now let's do a few examples: Seat belts and fatalities: State hypotheses: H 0 : The proportion of people killed is the same whether or not they are wearing a seatbelt or: H 0 : p 1 = p 2 (as defined above) H 1 : The proportions are not the same (we ll stick with a two-sided test for the moment) Let's use α =.05 We calculate our expected values from our observed values: Safety equipment in use Injury Fatal Non-fatal Total None 1,601 165,527 167,128 Seat belt 510 412,368 412,878 Total 2,111 577,895 580,006 So, our first expected value (fatal, none): (2,111 x 167,128)/580,006 = 608.28 Our second expected value (non-fatal, none): (577,895 x 167,128)/580,006 = 166,519.72 Our third expected value (fatal, seat belt): (2,111 x 412,878)/580,006 = 1,502.72 And finally, (non-fatal, seat belt): (577,895 x 412,878)/580,006 = 411,375.28

We can put this in a table if we want (to keep it straight): Expected values: Safety equipment in use Injury Fatal Non-fatal Total None 608.28 166,519.72 167,128 Seat belt 1,502.72 411,375.28 412,878 Total 2,111 577,895 580,006 Now let s calculate our χ 2* : χ 2 * = (1601 608.28)2 608.28 + (165,527 166,519.72)2 166,519.72 + (510 1502) 2 1502.72 + (412,368 411,375)2 411,375,28 = 2,284.25 If we look up our critical value of χ 2 in the table (1 d.f., and α =.05), we get: χ 2.05,1 = 3.84 So we reject H 0 and conclude that seat belts do affect the outcome of a traffic accident. Incidentally, p < 1 x 10-497, though I sort of doubt the accuracy of that figure. However, we suspect seatbelts save lives, so what we really wanted was a one sided test: Proceed as above, though now you re alternative hypothesis is H 1 : p 1 > p 2 (p 1 is the proportion of fatalities for folks not wearing seatbelts). Make sure the data deviate from the null hypothesis in the direction of your alternative, otherwise STOP. In other words, make sure p 1 p 2 (notice that p 1 = 0.00958, p 2 = 0.00124, so we're okay) Now just use the appropriate column in your χ 2 tables (divide by two): 2 table 2 =.05,1 = 2.71 And again we get to reject

Now let s do our second example: Notice that the χ 2 table value is lower, which makes rejection easier and gives us more power (not really needed in this example, but still true!). Hair color dark light Total Eye dark 729 131 860 color light 3129 2814 5943 Total 3858 2945 6803 H 0 : Eye color and Hair color are independent H 1 : They are not independent α =.05 Calculate expected values (I m skipping the details, they re in your text, and I went through them above): Hair color dark light Eye dark 487.71 372.29 color light 3370.29 2572.71 Calculate χ 2* (the same as usual, I m skipping the details): Our tabulated χ 2 = 3.84 χ 2* = 315.671 So we reject our H 0, and conclude H 1. Do we want to do anything else?? Yes, we can see which direction our sample deviates from our H 0. In other words, since eye color and hair color are not independent, we can look and see if dark eyes occurs more commonly with dark hair, or if dark eyes occurs more commonly with light hair: If a person has dark hair, the proportion of dark eyes is 729 / 3858 = 0.1890. If a person has light hair, the proportion of dark eyes is 131 / 2945 = 0.0425. From that information, we can conclude that dark hair goes with dark eyes, and therefore light hair with light eyes. (No surprise: blond, blue-eyed, etc. etc. )

Note that to calculate the proportions above we used the column totals in the denominator of both calculations. You can also use the row totals - as long as you're consistent. You'd be calculating different proportions, but you should still be able to interpret them: E.g., if you have dark eyes, the proportion of people with dark hair is, and then if you have light eyes the proportion of people with dark hair is, and so on. R x K tables. We refer to any tables bigger than 2 x 2 as R x K This is pretty easy to deal with. With the possible exception of figuring out your hypotheses, you know how to do this. Let's look at three different species of squirrel and compare food preferences: Peanuts Walnuts TOTAL Species A 21 32 53 Species B 10 15 25 Species C 15 12 27 TOTAL 46 59 105 An obvious thing to compare is if there's a difference in food preference between the three species. H 0 : The proportions of peanuts and walnuts is the same for all three species of squirrel. H 1 : The proportions are not the same. (Note that we can't do a directional or one sided alternative once we're bigger than 2 x 2 tables). α =.05 χ 2* = 2.0381 (calculated the same way as always!) df = ν = 2 (r-1)(k-1) = 2 x 1 = 2 From our table, the critical value of chi-square with 2 d.f. and an α of.05 = 5.991, so we reject. Our conclusion is that the proportions of blood types is not the same in the two groups (some blood types are more prone to getting ulcers - Biological note: what actually causes ulcers?).

IV) Comments: One can calculate things like Odds ratios and relative risk for tables like this. These are actually very important in medical trials. We don't have the time to go into the details, but notice that the relative risk isn't that difficult to calculate: For example, let's look at the risk of dying in a car accident when not wearing a seatbelt as opposed to wearing a seatbelt: ^ RR = ^p 1 ^p 2 = 0.00958 0.00124 = 7.73 Where ^ RR is the estimated relative risk. This tells us that the risk of dying in a car crash is 7.73 times higher than if you're not wearing a seatbelt!! The odds ratio is a similar measure, but requires rather more explanation. As mentioned, both are used extensively in medical trials (e.g., the risk of getting lung cancer if you smoke is about 40 times that of a non-smoker (figures are approximately correct!)) Finally, a quick word about the assumptions. They are identical to those for a goodness of fit test: Random data Smallest expected value 5