Categorical Data Analysis

Similar documents
Confidence Intervals for Association Parameters Testing Independence in Two-Way Contingency Tables Following-Up Chi-Squared Tests

UCLA STAT 110B Applied Statistics for Engineering and the Sciences

General IxJ Contingency Tables

Math 152. Rumbos Fall Solutions to Review Problems for Exam #2. Number of Heads Frequency

1 Models for Matched Pairs

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)

Lecture 7: Properties of Random Samples

Common Large/Small Sample Tests 1/55

Describing the Relation between Two Variables

Class 27. Daniel B. Rowe, Ph.D. Department of Mathematics, Statistics, and Computer Science. Marquette University MATH 1700

Table 12.1: Contingency table. Feature b. 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b. ... a N a1 N a2 N ab

Last Lecture. Wald Test

Direction: This test is worth 150 points. You are required to complete this test within 55 minutes.

Lecture 7: Non-parametric Comparison of Location. GENOME 560, Spring 2016 Doug Fowler, GS

Chapter 13, Part A Analysis of Variance and Experimental Design

Important Formulas. Expectation: E (X) = Σ [X P(X)] = n p q σ = n p q. P(X) = n! X1! X 2! X 3! X k! p X. Chapter 6 The Normal Distribution.

MOST PEOPLE WOULD RATHER LIVE WITH A PROBLEM THEY CAN'T SOLVE, THAN ACCEPT A SOLUTION THEY CAN'T UNDERSTAND.

Continuous Data that can take on any real number (time/length) based on sample data. Categorical data can only be named or categorised

Chapter 6 Principles of Data Reduction

7-1. Chapter 4. Part I. Sampling Distributions and Confidence Intervals

Introduction to Econometrics (3 rd Updated Edition) Solutions to Odd- Numbered End- of- Chapter Exercises: Chapter 3

Simulation. Two Rule For Inverting A Distribution Function

Probability and statistics: basic terms

( θ. sup θ Θ f X (x θ) = L. sup Pr (Λ (X) < c) = α. x : Λ (x) = sup θ H 0. sup θ Θ f X (x θ) = ) < c. NH : θ 1 = θ 2 against AH : θ 1 θ 2

Sampling Distributions, Z-Tests, Power

5. Likelihood Ratio Tests

KLMED8004 Medical statistics. Part I, autumn Estimation. We have previously learned: Population and sample. New questions

Summary. Recap ... Last Lecture. Summary. Theorem

Lecture 7: Non-parametric Comparison of Location. GENOME 560 Doug Fowler, GS

Random Variables, Sampling and Estimation

STAT431 Review. X = n. n )

Hypothesis Testing. Evaluation of Performance of Learned h. Issues. Trade-off Between Bias and Variance

Agreement of CI and HT. Lecture 13 - Tests of Proportions. Example - Waiting Times

STAT 515 fa 2016 Lec Sampling distribution of the mean, part 2 (central limit theorem)

Stat 319 Theory of Statistics (2) Exercises

( ) = p and P( i = b) = q.

Formulas and Tables for Gerstman

Lecture 2: Monte Carlo Simulation

Chapter 6 Sampling Distributions

Chapter 13: Tests of Hypothesis Section 13.1 Introduction

Lecture 18: Sampling distributions

Final Review. Fall 2013 Prof. Yao Xie, H. Milton Stewart School of Industrial Systems & Engineering Georgia Tech

Topic 9: Sampling Distributions of Estimators

Biostatistics for Med Students. Lecture 2

If, for instance, we were required to test whether the population mean μ could be equal to a certain value μ

1036: Probability & Statistics

Final Examination Solutions 17/6/2010

Topic 9: Sampling Distributions of Estimators

Lecture 6 Simple alternatives and the Neyman-Pearson lemma

Chapter VII Measures of Correlation

Pearson Edexcel Level 3 Advanced Subsidiary and Advanced GCE in Statistics

Topic 9: Sampling Distributions of Estimators

11 Correlation and Regression

MATH/STAT 352: Lecture 15

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

Comparing Two Populations. Topic 15 - Two Sample Inference I. Comparing Two Means. Comparing Two Pop Means. Background Reading

[412] A TEST FOR HOMOGENEITY OF THE MARGINAL DISTRIBUTIONS IN A TWO-WAY CLASSIFICATION

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Sample Size Estimation in the Proportional Hazards Model for K-sample or Regression Settings Scott S. Emerson, M.D., Ph.D.

Distribution of Random Samples & Limit theorems

Statistical Intervals for a Single Sample

The Sampling Distribution of the Maximum. Likelihood Estimators for the Parameters of. Beta-Binomial Distribution

Because it tests for differences between multiple pairs of means in one test, it is called an omnibus test.

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 9

[ ] ( ) ( ) [ ] ( ) 1 [ ] [ ] Sums of Random Variables Y = a 1 X 1 + a 2 X 2 + +a n X n The expected value of Y is:

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures

Since X n /n P p, we know that X n (n. Xn (n X n ) Using the asymptotic result above to obtain an approximation for fixed n, we obtain

Lecture 5. Random variable and distribution of probability

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

Topic 18: Composite Hypotheses

STAC51: Categorical data Analysis

A statistical method to determine sample size to estimate characteristic value of soil parameters

A proposed discrete distribution for the statistical modeling of

Chapter 22. Comparing Two Proportions. Copyright 2010, 2007, 2004 Pearson Education, Inc.

Module 1 Fundamentals in statistics

1 Inferential Methods for Correlation and Regression Analysis

Lecture 2: Poisson Sta*s*cs Probability Density Func*ons Expecta*on and Variance Es*mators

Element sampling: Part 2


TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Binomial Distribution

Overview. p 2. Chapter 9. Pooled Estimate of. q = 1 p. Notation for Two Proportions. Inferences about Two Proportions. Assumptions

Lecture 3. Properties of Summary Statistics: Sampling Distribution

Sample Size Determination (Two or More Samples)

Final Examination Statistics 200C. T. Ferguson June 10, 2010

Lecture Notes 15 Hypothesis Testing (Chapter 10)

CS284A: Representations and Algorithms in Molecular Biology

MA238 Assignment 4 Solutions (part a)

GG313 GEOLOGICAL DATA ANALYSIS

IE 230 Probability & Statistics in Engineering I. Closed book and notes. No calculators. 120 minutes.

Lecture 4. Random variable and distribution of probability

Chapter 1 (Definitions)

Chapter 22. Comparing Two Proportions. Copyright 2010 Pearson Education, Inc.

Lecture 8: Non-parametric Comparison of Location. GENOME 560, Spring 2016 Doug Fowler, GS

Statistical Hypothesis Testing. STAT 536: Genetic Statistics. Statistical Hypothesis Testing - Terminology. Hardy-Weinberg Disequilibrium

October 25, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 1

6. Sufficient, Complete, and Ancillary Statistics

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

Agenda: Recap. Lecture. Chapter 12. Homework. Chapt 12 #1, 2, 3 SAS Problems 3 & 4 by hand. Marquette University MATH 4740/MSCS 5740

Expectation and Variance of a random variable

Transcription:

Categorical Data Aalysis Refereces : Ala Agresti, Categorical Data Aalysis, Wiley Itersciece, New Jersey, 2002 Bhattacharya, G.K., Johso, R.A., Statistical Cocepts ad Methods, Wiley,1977

Outlie Categorical Respose Data Distributio of For Categorical Data Pearso s Test for Goodess of Fit Cotigecy Tables Test of Homogeeity ad Exact Test

Categorical Respose Data A categorical variable has a measuremet scale cosistig of a set of categories. For istace political philosophy is ofte measured as: liberal, moderate or coservative religious affiliatio with the categories: Protestat, Catholic, Muslim, Hidus, Budhis, etc

Nomial Ordial Scale Distictio Categorical variables have two primary types of scales. Nomial : variables havig categories without atural orderig. Examples Mode of trasportatio to work : automobile, bicycle, bus, walk Favorite type of music: jazz, classical, rock, pop, dagdut, kerocog Ordial : may categorical variables do have ordered categories. Examples Size of automobile : subcompact, compact, midsize, large Social class : upper, middle, lower Political philosophy : liberal, moderate, coservative

Nomial Ordial Scale Distictio A iterval variable is oe that does have umerical distaces betwee ay two values. For examples, blood pressure level, fuctioal life legth of TV set, legth of priso term ad aual icome are iterval variables.

Nomial Ordial Scale Distictio The way that a variable is measured determies its classificatio. For example, educatio is oly omial whe measured as public school or private school; it is ordial whe measured by highest degree attaied, usig the categories oe, higsh school, bachelor s, master s ad doctorate. It is iterval whe measured by umber of years of educatio, usig the itegers 0,1,2,...

Nomial Ordial Scale Distictio A variable s measuremet scale determies which statistical methods are appropriate. riate The measuremet hierarchy from high to low: Iterval Ordial Nomial Methods for ordial variables caot be used with omial variables, sice their categories have o meaigful orderig. It is usually best to apply methods appropriate for the actual scale.

Dt Data Type Quatitative (Numerical) Qualitative (Categorical) Discrete Cotiue Discrete

Quatitative vs. Qualitative Quatitative Data Variables recorded i umbers that we use as umbers are called quatitative Examples: Icomes, Heights Weights, Ages ad Couts Quatitative variables have measuremet uits Qualitative Data The umbers here are just labels ad their values are arbitrary. They represet categories of the variables. We call such variables categorical. Examples: Sex, Area Code Productio group i a certai locatio.

Discrete vs. Cotiues Discrete Data The data are iteger ad usually they are comig from couted process Cotiues Data The data usually iterval scale. They are measuremet data Examples: Number of employee Number of rejected lot Examples: Temperature Heights, Weights

Discrete Data Nomial The rak of the data are ot importat Examples Productio Group 1 Group A 2 Group B 3 Group C Ordial The rak of the data meaigful. Examples Frequecy of smokig 1 very ofte 2 ofte 3 rare 4 ever

Distributios for Categorical Data Biomial Distributio Let y 1,y 2,...,y,y deote resposes for idepedet ad idetical trials such that P(Y i =1) = π ad P(Y i =0) = 1- π Idetical trials meas that t the probability bilit of success, π, is the same for each trial. Idepedet trials meas that the {Y i} are idepedet radom variables. These are ofte called as Beroulli trials. The total umber of successes, has the biomial distributio with idex ad parameter π, deoted by bi(, π)

Distributios for Categorical Data The probability mass fuctio for the possible outcome y for Y is y y p( y) = (1 ), y = π π y 0,1,2,..., The biomial distributio for Y = i Y i has mea ad variace μ = E( Y) = π, ad, σ = var( Y) = π (1 π) There is o guaratee that successive biary observatios are idepedet or idetical. 2

Distributios for Categorical Data Multiomial Distributio Some trials have more tha two possible outcomes. Suppose that each of idepedet, idetical trials ca have outcome i ay of c categories. Let 1 if trial i has outcome i ay of c categories y ij = 0 otherwise The y i = yi, y with j Y ij = 1 ( 1 i2,..., yic ) represets a multiomial trial,

Distributios for Categorical Data Let j = i Y ij deote the umber of trials havig outcome i category j. The couts ( 1, 2,..., c ) have the multiomial distributio. Let π j = P(Y ij = 1) deote the probability of outcome i category j for each trial. The multiomial i l probability bilit mass fuctio is p E! 1 2 (, 2,..., c 1 ) = π 1 π 2... π c 1! 2!... c! 1 ( j j j j j ) = π, var( ) = π (1 π ) c

Distributios for Categorical Data Poisso Distributio Sometimes, cout data do ot result from a fixed umber of trials. There is o upper limit for y. Sice y must be a oegative iteger, its distributio should place its mass o that rage. The simplest such distributio ib ti is the Poisso. μ y The Poisso mass fuctio e μ P( y) =, y = 0,1,2,... E( y) = var( y) = μ The distributio approaches ormality as μ icreases. y!!

Pearso s s Test for GoF Null Hypothesis : H o :p 1 =p 10,,p,p k =p ko The Pearso X 2 test statistic : X ( ) k 2 2 2 i pi0 ( O E) = i= 1 p i0 = cells E Distributio : X 2 is asymptotically chi-squared with df = k-1 Reject regio : X 2 χ 2 α, where χ 2 α is the upper α poit of the χ 2 distributio ib ti with df = k-1

Cotigecy Table B 1 B 2 B c Row Total A 1 11 12 1c 10 p = ij P ( Ai B j Probability bili of the joit occurace ) A 2 21 22 2C 20 A r r1 r2 rc r0 Colum 01 02 0c Total p = oj P ( B j Total probability i the jth colum ) of A i ad B j p = P ( A ) p i0 0 i Total probability i the ith row B 1 B 2 B c Row Total A 1 p 11 p 12 p 1c p 10 A 2 p 21 p 22 p 2C p 20 A r p r1 p r2 p rc p r0 Colum p 01 p 02 p 0c 1 Total

Cotigecy Table The ull hypothesis of idepedece for all cells (i,j) H : p = 0 ij p io p oj Estimatio: ˆ i0 oj pi 0 =, pˆ 0 j =, pˆ ij = pˆ i0 pˆ oj = Expectatio: i 0 0 j Eij = pˆ ij = The test statistic the becomes: 2 2 ( ij E ij ) X = all rccells i0 oj 2 which has a approximate χ 2 distributio with df= d.f (r-1)(c-1) E ij

Test of Homogeeity The χ 2 test of idepedece is based o the samplig scheme i which a sigle radom sample of size is classified with respect to two characteristics simultaeously. A alterative samplig scheme ivolves a divisio of the populatio ito subpopulatios or strata accordig to the categories of oe characteristic. A radom sample of a predetermied size is draw from each stratum ad classified ito categories of the other characteristic

Cotigecy Table B 1 B 2 B c Row Total A 1 11 12 1c 10 w = ij P ( B j A i ) A 2 21 22 2C 20 Probability B j of withi the populatio lti A A r r1 r2 rc i r0 Colum Total 01 02 0c B 1 B 2 B c Row Total A 1 w 11 w 12 w 1c 1 A 2 w 21 w 22 w 2C 1 A r w r1 w r2 w rc 1

Test Homogeeity Estimatio: Expectatio: The ull hypothesis of idepedece H w = w =... = 0 : 1 j 2 For every j = 1,,c j w oj wˆ 1 j = wˆ 2 j =... = wˆ rj = Eij = (No.of Ai sampled)x(estimated prob. of B j withi A = i0w ˆ ij = i0 0 j The test statistic the becomes: X 2 = ( E ) all ij rc cells which has a approximate χ 2 distributio with d.f = (r-1)(c-1) ij E ij 2 rj i

Measures of Associatio i a Cotigecy Table Cramer s cotigecy coefficiet: Q 2 1 = χ,0 Q 1 ( q 1) Pearso s s coefficiet of mea square cotigecy: Q 2 = χ 2 + χ 2 0 Q, 2 1 q 1 Pearso s phi coefficiet i 2x2 table: ( 1122 1221) φ =, 1 φ 1 10 20 01 02 q

Small sample test of idepedece Whe is small, alterative methods use exact smallsample distributios rather tha large-sample approximatios. Fisher s Exact Test for 2x2 Tables We kow that, for Poisso samplig othig is fixed, for multiomial samplig oly is fixed, ad for idepedet biomial samplig i the two rows oly the row of margial totals are fixed. I ay of these cases, uder H 0 : idepedece, coditioig o both sets of margial totals yields the hypergeometric distributio p( t) = p( = t 11 ) 1 = t + 2+ + 1 t +1 This formula expresses the distributio of { ij } i terms of oly 11. Give the margial totals, 11 determies the other three cell couts.

Small sample test of idepedece For 2x2 tables, idepedece is equivalet to the odds ratio θ = 1. To test H 0 : θ = 1, the P-value is the sum of certai hypergeometric probabilities. To illustrate, cosider H a: θ > 1. For the give margial totals, tables havig larger 11 have larger odds ratios ad hece stroger evidece i favor of H a. Thus, the P-value equals P( 11 t 0 ), where t 0 deotes the observed value of 11. This test for 2x2 tables is called Fisher s exact test

Fisher s s TeaDriker Muriel Bristol, a colleague of Fisher s, s, claimed that whe drikig tea she could distiguish whether milk or tea was added to the cup first (she preferred milk added first) Poured First Guess Poured First Milk Tea Total Milk 3 1 4 Tea 1 3 4 Total 4 4

Fisher s s Tea Driker Distiguishig the order of pourig better tha with pure guessig correspods to θ > 1, reflectig a positive associatio betwee order of pourig ad the predictio. We coduct Fisher s exact test of H 0 : θ = 1 agaist H a : θ > 1 The observed table, t 0 = 3 correct choices of the cups havig milk added first, has ull probability 4 4 3 1 8 4 = 0.229 The P-value is P( 11 3) = 0.243. This result does ot establish a associatio betwee the actual order of pourig ad her predictios. It is difficult to do so with such a small sample. Accordig to Fisher s daughter (Box, 1978,p.134), 134) i reality Bristol did covice Fisher of her ability.