Categorical Data Analysis - PDF Free Download

Categorical Data Aalysis Refereces : Ala Agresti, Categorical Data Aalysis, Wiley Itersciece, New Jersey, 2002 Bhattacharya, G.K., Johso, R.A., Statistical Cocepts ad Methods, Wiley,1977

Outlie Categorical Respose Data Distributio of For Categorical Data Pearso s Test for Goodess of Fit Cotigecy Tables Test of Homogeeity ad Exact Test

Categorical Respose Data A categorical variable has a measuremet scale cosistig of a set of categories. For istace political philosophy is ofte measured as: liberal, moderate or coservative religious affiliatio with the categories: Protestat, Catholic, Muslim, Hidus, Budhis, etc

Nomial Ordial Scale Distictio Categorical variables have two primary types of scales. Nomial : variables havig categories without atural orderig. Examples Mode of trasportatio to work : automobile, bicycle, bus, walk Favorite type of music: jazz, classical, rock, pop, dagdut, kerocog Ordial : may categorical variables do have ordered categories. Examples Size of automobile : subcompact, compact, midsize, large Social class : upper, middle, lower Political philosophy : liberal, moderate, coservative

Nomial Ordial Scale Distictio A iterval variable is oe that does have umerical distaces betwee ay two values. For examples, blood pressure level, fuctioal life legth of TV set, legth of priso term ad aual icome are iterval variables.

Nomial Ordial Scale Distictio The way that a variable is measured determies its classificatio. For example, educatio is oly omial whe measured as public school or private school; it is ordial whe measured by highest degree attaied, usig the categories oe, higsh school, bachelor s, master s ad doctorate. It is iterval whe measured by umber of years of educatio, usig the itegers 0,1,2,...

Nomial Ordial Scale Distictio A variable s measuremet scale determies which statistical methods are appropriate. riate The measuremet hierarchy from high to low: Iterval Ordial Nomial Methods for ordial variables caot be used with omial variables, sice their categories have o meaigful orderig. It is usually best to apply methods appropriate for the actual scale.

Dt Data Type Quatitative (Numerical) Qualitative (Categorical) Discrete Cotiue Discrete

Quatitative vs. Qualitative Quatitative Data Variables recorded i umbers that we use as umbers are called quatitative Examples: Icomes, Heights Weights, Ages ad Couts Quatitative variables have measuremet uits Qualitative Data The umbers here are just labels ad their values are arbitrary. They represet categories of the variables. We call such variables categorical. Examples: Sex, Area Code Productio group i a certai locatio.

Discrete vs. Cotiues Discrete Data The data are iteger ad usually they are comig from couted process Cotiues Data The data usually iterval scale. They are measuremet data Examples: Number of employee Number of rejected lot Examples: Temperature Heights, Weights

Discrete Data Nomial The rak of the data are ot importat Examples Productio Group 1 Group A 2 Group B 3 Group C Ordial The rak of the data meaigful. Examples Frequecy of smokig 1 very ofte 2 ofte 3 rare 4 ever

Distributios for Categorical Data Biomial Distributio Let y 1,y 2,...,y,y deote resposes for idepedet ad idetical trials such that P(Y i =1) = π ad P(Y i =0) = 1- π Idetical trials meas that t the probability bilit of success, π, is the same for each trial. Idepedet trials meas that the {Y i} are idepedet radom variables. These are ofte called as Beroulli trials. The total umber of successes, has the biomial distributio with idex ad parameter π, deoted by bi(, π)

Distributios for Categorical Data The probability mass fuctio for the possible outcome y for Y is y y p( y) = (1 ), y = π π y 0,1,2,..., The biomial distributio for Y = i Y i has mea ad variace μ = E( Y) = π, ad, σ = var( Y) = π (1 π) There is o guaratee that successive biary observatios are idepedet or idetical. 2

Distributios for Categorical Data Multiomial Distributio Some trials have more tha two possible outcomes. Suppose that each of idepedet, idetical trials ca have outcome i ay of c categories. Let 1 if trial i has outcome i ay of c categories y ij = 0 otherwise The y i = yi, y with j Y ij = 1 ( 1 i2,..., yic ) represets a multiomial trial,

Distributios for Categorical Data Let j = i Y ij deote the umber of trials havig outcome i category j. The couts ( 1, 2,..., c ) have the multiomial distributio. Let π j = P(Y ij = 1) deote the probability of outcome i category j for each trial. The multiomial i l probability bilit mass fuctio is p E! 1 2 (, 2,..., c 1 ) = π 1 π 2... π c 1! 2!... c! 1 ( j j j j j ) = π, var( ) = π (1 π ) c

Distributios for Categorical Data Poisso Distributio Sometimes, cout data do ot result from a fixed umber of trials. There is o upper limit for y. Sice y must be a oegative iteger, its distributio should place its mass o that rage. The simplest such distributio ib ti is the Poisso. μ y The Poisso mass fuctio e μ P( y) =, y = 0,1,2,... E( y) = var( y) = μ The distributio approaches ormality as μ icreases. y!!

Pearso s s Test for GoF Null Hypothesis : H o :p 1 =p 10,,p,p k =p ko The Pearso X 2 test statistic : X ( ) k 2 2 2 i pi0 ( O E) = i= 1 p i0 = cells E Distributio : X 2 is asymptotically chi-squared with df = k-1 Reject regio : X 2 χ 2 α, where χ 2 α is the upper α poit of the χ 2 distributio ib ti with df = k-1

Cotigecy Table B 1 B 2 B c Row Total A 1 11 12 1c 10 p = ij P ( Ai B j Probability bili of the joit occurace ) A 2 21 22 2C 20 A r r1 r2 rc r0 Colum 01 02 0c Total p = oj P ( B j Total probability i the jth colum ) of A i ad B j p = P ( A ) p i0 0 i Total probability i the ith row B 1 B 2 B c Row Total A 1 p 11 p 12 p 1c p 10 A 2 p 21 p 22 p 2C p 20 A r p r1 p r2 p rc p r0 Colum p 01 p 02 p 0c 1 Total

Cotigecy Table The ull hypothesis of idepedece for all cells (i,j) H : p = 0 ij p io p oj Estimatio: ˆ i0 oj pi 0 =, pˆ 0 j =, pˆ ij = pˆ i0 pˆ oj = Expectatio: i 0 0 j Eij = pˆ ij = The test statistic the becomes: 2 2 ( ij E ij ) X = all rccells i0 oj 2 which has a approximate χ 2 distributio with df= d.f (r-1)(c-1) E ij

Test of Homogeeity The χ 2 test of idepedece is based o the samplig scheme i which a sigle radom sample of size is classified with respect to two characteristics simultaeously. A alterative samplig scheme ivolves a divisio of the populatio ito subpopulatios or strata accordig to the categories of oe characteristic. A radom sample of a predetermied size is draw from each stratum ad classified ito categories of the other characteristic

Cotigecy Table B 1 B 2 B c Row Total A 1 11 12 1c 10 w = ij P ( B j A i ) A 2 21 22 2C 20 Probability B j of withi the populatio lti A A r r1 r2 rc i r0 Colum Total 01 02 0c B 1 B 2 B c Row Total A 1 w 11 w 12 w 1c 1 A 2 w 21 w 22 w 2C 1 A r w r1 w r2 w rc 1

Test Homogeeity Estimatio: Expectatio: The ull hypothesis of idepedece H w = w =... = 0 : 1 j 2 For every j = 1,,c j w oj wˆ 1 j = wˆ 2 j =... = wˆ rj = Eij = (No.of Ai sampled)x(estimated prob. of B j withi A = i0w ˆ ij = i0 0 j The test statistic the becomes: X 2 = ( E ) all ij rc cells which has a approximate χ 2 distributio with d.f = (r-1)(c-1) ij E ij 2 rj i

Measures of Associatio i a Cotigecy Table Cramer s cotigecy coefficiet: Q 2 1 = χ,0 Q 1 ( q 1) Pearso s s coefficiet of mea square cotigecy: Q 2 = χ 2 + χ 2 0 Q, 2 1 q 1 Pearso s phi coefficiet i 2x2 table: ( 1122 1221) φ =, 1 φ 1 10 20 01 02 q

Small sample test of idepedece Whe is small, alterative methods use exact smallsample distributios rather tha large-sample approximatios. Fisher s Exact Test for 2x2 Tables We kow that, for Poisso samplig othig is fixed, for multiomial samplig oly is fixed, ad for idepedet biomial samplig i the two rows oly the row of margial totals are fixed. I ay of these cases, uder H 0 : idepedece, coditioig o both sets of margial totals yields the hypergeometric distributio p( t) = p( = t 11 ) 1 = t + 2+ + 1 t +1 This formula expresses the distributio of { ij } i terms of oly 11. Give the margial totals, 11 determies the other three cell couts.

Small sample test of idepedece For 2x2 tables, idepedece is equivalet to the odds ratio θ = 1. To test H 0 : θ = 1, the P-value is the sum of certai hypergeometric probabilities. To illustrate, cosider H a: θ > 1. For the give margial totals, tables havig larger 11 have larger odds ratios ad hece stroger evidece i favor of H a. Thus, the P-value equals P( 11 t 0 ), where t 0 deotes the observed value of 11. This test for 2x2 tables is called Fisher s exact test

Fisher s s TeaDriker Muriel Bristol, a colleague of Fisher s, s, claimed that whe drikig tea she could distiguish whether milk or tea was added to the cup first (she preferred milk added first) Poured First Guess Poured First Milk Tea Total Milk 3 1 4 Tea 1 3 4 Total 4 4

Fisher s s Tea Driker Distiguishig the order of pourig better tha with pure guessig correspods to θ > 1, reflectig a positive associatio betwee order of pourig ad the predictio. We coduct Fisher s exact test of H 0 : θ = 1 agaist H a : θ > 1 The observed table, t 0 = 3 correct choices of the cups havig milk added first, has ull probability 4 4 3 1 8 4 = 0.229 The P-value is P( 11 3) = 0.243. This result does ot establish a associatio betwee the actual order of pourig ad her predictios. It is difficult to do so with such a small sample. Accordig to Fisher s daughter (Box, 1978,p.134), 134) i reality Bristol did covice Fisher of her ability.