STAC51: Categorical data Analysis

STAC51: Categorical data Aalysis Mahida Samarakoo Jauary 28, 2016 Mahida Samarakoo STAC51: Categorical data Aalysis 1 / 35

Table of cotets Iferece for Proportios 1 Iferece for Proportios Mahida Samarakoo STAC51: Categorical data Aalysis 2 / 35

Commo Steps i Statistical Studies Statistical studies usually ivolves the followig steps. Clearly state the problem or questio you are tryig to aswer Thik about what kid of data will help you aswer the questio Decide o a appropriate statistical model for the data Collect data Clea data (remove outliers etc) ad examie data: data summaries, displays Use the data to estimate parameters of the model Carryout appropriate tests that will aswer your questio. Sometimes you may have to recosider the model ad re-estimate the parameters, redo tests Draw coclusios about your questio Mahida Samarakoo STAC51: Categorical data Aalysis 3 / 35

Commo Steps i Statistical Studies: A simple example We wat to kow if a coi is fair. i.e. P(H) = 0.5 (This is our questio) Data: Toss the coi may times ad observe the outcomes Model for data: A Beroulli model? What is the parameter? π Data Collectio: Decide ad toss the coi times ad recored the outcomes Parameter estimatio: Use the data to estimate the parameter Appropriate hypotheses: H 0 : π = 0.5 agaist H a : π 0.5. Use the data (ad may be the parameter estimates above) to do the tests Draw coclusios: i.e. Does the test above idicate that coi is ot fair? Mahida Samarakoo STAC51: Categorical data Aalysis 4 / 35

Iferece for Proportios Let Y be the umber of successes (i.e. 1 s) i idepedet Beroulli trials with success probability π. The probability of a success π is usually a ukow parameter ad we estimate it by the sample proportio of successes: ˆπ = Y Some properties of ˆπ 1 ˆπ is a ubiased estimator of π (i.e. E(ˆπ) = π). 2 Var(ˆπ) = π(1 π) 3 ˆπ Pr π by WLLN 4 ˆπ approx N(π, π(1 π) ) for large, by CLT Mahida Samarakoo STAC51: Categorical data Aalysis 5 / 35

Estimatio of π: Likelihood fuctio Defiitio (Page 9, text): The likelihood fuctio is the probability of the observed data, expressed as a fuctio of the parameter value. Example: We toss a coi twice (i.e. = 2) ad observe oe head (ad oe tail). P(H) = π, ukow. Fid the likelihood fuctio. Aswer: The umber of heads whe a coi if is tossed twice has a Bi( = 2, π) distributio ad so the likelihood fuctio is l(π) = ( 2 1) π 1 (1 π) 2 1 = 2π(1 π) Mahida Samarakoo STAC51: Categorical data Aalysis 6 / 35

Estimatio of π: Maximum Likelihood Estimator (MLE) Defiitio (MLE): The maximum likelihood estimate (MLE) is the parameter value at which the likelihood fuctio takes its maximum. Example: We toss a coi twice (i.e. = 2) ad observe oe head (ad oe tail). P(H) = π, ukow. Fid the MLE. Aswer: The umber of heads whe a coi if is tossed twice has a Bi( = 2, π) distributio ad so the likelihood fuctio is l(π) = ( 2 1) π 1 (1 π) 2 1 = 2π(1 π). l(π) is maximized whe π = 0.5 ad so the MLE of π is 0.5. Mahida Samarakoo STAC51: Categorical data Aalysis 7 / 35

Some Properties of MLEs If Y 1, Y 2,..., Y are i.i.d. Normal (or may other distributios, such as Poisso), ML estimate of the populatio mea is the sample mea ˆµ = Ȳ. I ordiary regressio (Y Normal) least squares estimates are MLEs. For large, MLEs have approximate ormal samplig distributios (uder weak coditios) Mahida Samarakoo STAC51: Categorical data Aalysis 8 / 35

Example 2 (MLE) Iferece for Proportios A coi with P(H) = π was tossed 20 times ad observed 13 heads. Fid the likelihood fuctio Aswer: l(π) = π 13 (1 π) (20 13) = π 13 (1 π) 7. Plot the likelihood fuctio fid the value value of π that maximizes l(π). Mahida Samarakoo STAC51: Categorical data Aalysis 9 / 35

Example 2 (MLE): R code ad output > #R code for fidig the MLE of pi where Y~Bi(20, pi) > # ad obsered y = 13 > likelihood <- fuctio(pi) { (pi^13)*((1-pi)^7) } > curve(likelihood, from=0, to=1,, xlab="pi", ylab="likelihood(pi)") > optimize(likelihood, iterval=c(0, 1), maximum=true) $maximum [1] 0.6500009 $objective [1] 2.378756e-06 > ablie(v=(seq(0,1,by=0.02)), col="blue", lty="dotted") > ablie(h=(seq(0,2.5e-6,0.25e-6)), col="blue", lty="dotted Mahida Samarakoo STAC51: Categorical data Aalysis 10 / 35

Example 2 (MLE) Iferece for Proportios Figure: Likelihood Fuctio Mahida Samarakoo STAC51: Categorical data Aalysis 11 / 35

Sigificace Tests for biomial parameter (i.e proportios) Let Y Bi(, π). We are iterested i testig the ull hypotheses H 0 : π = π 0. Example We toss a coi = 10 times ad observe y = 3 heads. P(H) = π. Test the ull hypothesis H 0 : π = 0.5 agaist H 1 : π < 0.5 Aswer: p value = P(Y = 3) + P(Y = 2) + P(Y = 1) + P(Y = 0) ( ) ( ) 10 10 = (0.5) 3 (1 0.5) 10 3 + (0.5) 2 (1 0.5) 10 2 3 2 ( ) ( ) 10 10 + (0.5) 1 (1 0.5) 10 1 + (0.5) 0 (1 0.5) 10 1 0 = 0.171875. p-value > 0.05 ad so we do ot reject the ull hypothesis. Note: I this case, p- value = P(Y y obs ) Mahida Samarakoo STAC51: Categorical data Aalysis 12 / 35

Sigificace Tests for biomial parameter (i.e proportios) Example We toss a coi = 10 times ad observe y = 8 heads. P(H) = π. Test the ull hypothesis H 0 : π = 0.6 agaist H 1 : π > 0.6. Aswer: p value = P(Y = 8) + P(Y = 9) + P(Y = 10) ( ) ( ) 10 10 = (0.6) 8 (1 0.6) 10 8 + (0.6) 9 (1 0.6) 10 9 8 9 ( ) 10 + (0.6) 1 0(1 0.6) 10 10 10 = 0.16728. p-value > 0.05 ad so we do ot reject the ull hypothesis. Note: I this case, p- value = P(Y y obs ) Mahida Samarakoo STAC51: Categorical data Aalysis 13 / 35

Sigificace Tests for biomial parameter (i.e proportios) Example We toss a coi = 10 times ad observe y = 8 heads. P(H) = π. Test the ull hypothesis H 0 : π = 0.6 agaist H 1 : π 0.6. I this case we take p-value = 2 mi (P(Y y obs ), P(Y y obs )). I the previous example, we foud that P(Y y obs ) = 0.16728. P(Y y obs ) = P(Y = 8) + P(Y = 7) + + P(Y = 0) ( ) ( ) 10 10 = (0.6) 8 (1 0.6) 10 8 + (0.6) 7 (1 0.6) 1 8 7 ( ) 10 + + (0.6) 0 (1 0.6) 10 0 0 = 0.953642 p-value = 2 mi(0.953642, 0.16728) = 2 0.16728 = 0.33456 > 0.05 ad so we do ot reject the ull hypothesis. Mahida Samarakoo STAC51: Categorical data Aalysis 14 / 35

Sigificace Tests for biomial parameter (i.e proportios) I the p-value for two-tailed test p-value = 2 mi (P(Y y obs ), P(Y y obs )), we iclude y obs i both terms. This sometimes gives p-values greater tha 1. I that case we will take p-value as 1. Example We toss a coi = 10 times ad observe y = 8 heads. P(H) = π. Test the ull hypothesis H 0 : π = 0.76 agaist H 1 : π 0.76. P(Y y obs ) = P(Y = 8) + P(Y = 7) + + P(Y = 0) ( ) ( 10 10 = (0.76) 8 (1 0.76) 10 8 + 8 7 + + = 0.73269 ( 10 0 ) (0.76) 0 (1 0.76) 10 0 ) (0.76) 7 (1 0 Mahida Samarakoo STAC51: Categorical data Aalysis 15 / 35

Sigificace Tests for biomial parameter (i.e proportios) P(Y y obs ) = P(Y = 8) + P(Y = 9) + P(Y = 10) ( ) ( ) 10 10 = (0.76) 8 (1 0.76) 10 8 + (0.76) 9 (1 0 8 9 ( ) 10 + (0.76) 1 0(1 0.76) 10 10 10 = 0.55580. p-value = 2 mi(0.73269, 0.55580) = 2 0.55580 = 1.1116 Mahida Samarakoo STAC51: Categorical data Aalysis 16 / 35

Large sample tests Iferece for Proportios For testig the ull hypothesis H 0 : π = π 0, we ca use the test statistic Z = ˆπ π 0 π 0 (1 π 0 ) (1) uder the ull hypothesis ad for large eough sample size Z approx N(0, 1). This result ca be used to calculate the p-value for the test of H 0 : π = π 0. A large sample 100(1 α) percet cofidece iterval for π is ˆπ(1 ˆπ) give by ˆπ ± z α/2 SE where SE =. Mahida Samarakoo STAC51: Categorical data Aalysis 17 / 35

Example(Agresti): Whe the 2000 Geeral Social Survey asked subjects whether they would be willig to accept cuts i their stadard of livig to protect the eviromet, 344 of 1170 subjects said yes. a) Estimate the populatio proportio who would say yes. b) Coduct a sigificace test to determie whether a majority or miority of the populatio would say yes. Report ad iterpret the p-value. Mahida Samarakoo STAC51: Categorical data Aalysis 18 / 35

The R code below calculates the value test statistic, p-value ad the required cofidece iterval. > # R code for the z-test for a sigle proportio > y <- 344 > <- 1170 > p0 <- 0.5 > alpha <- 0.01 > phat <- y/ > z <- (phat-p0)/sqrt((p0*(1-p0))/) > p_value = 2*(1- porm(abs(z))) > z [1] -14.0914 > p_value [1] 0 > phat [1] 0.2940171 Mahida Samarakoo STAC51: Categorical data Aalysis 19 / 35

The R commad prop.test will also produce the calculatios required to aswer these questios. The commad help(prop.test) will show details of the commad. R chi-square test equivalet to the Z-test. > res<-prop.test(x=344,=1170,cof.level=0.99,correct=f, p= > res 1-sample proportios test without cotiuity correctio data: 344 out of 1170, ull probability 0.5 X-squared = 198.5675, df = 1, p-value < 2.2e-16 alterative hypothesis: true p is ot equal to 0.5 p 0.2940171 > Mahida Samarakoo STAC51: Categorical data Aalysis 20 / 35

Cofidece itervals for Proportios For large sample we have used the formula where SE = ˆπ(1 ˆπ) ˆπ ± z α/2 SE for approximate cofidece itervals. I the above example, ˆπ = 344 1170 = 0.294, ˆπ(1 ˆπ) 0.294(1 0.294) SE = = 1170 = 0.013319 ad the cofidece iterval is 0.294 ± 2.575 0.013319 = (0.2597081, 0.3283261). Mahida Samarakoo STAC51: Categorical data Aalysis 21 / 35

Cofidece itervals for Proportios For large sample we have used the formula ˆπ(1 ˆπ) ˆπ ± z α/2 SE where SE = for approximate cofidece itervals. The above cofidece iterval, kow as Wald s cofidece iterval is based o the approximate Normal distributio for ˆπ. For large eough, Z = ˆπ π π(1 π) N(0, 1) ad so ) π(1 π) π(1 π) P (ˆπ z α/2 < π < ˆπ + z α/2 = 1 α. Wald s method replaces π by ˆπ i the stadard deviatio (i.e. π(1 π) ˆπ(1 ˆπ) ) to get ˆπ ± z α/2. Wald CI ofte has poor performace i categorical data aalysis uless quite large. Example For = 25, y = 0, ˆπ = 0 ad the Wald cofidece iterval is (0, 0). Mahida Samarakoo STAC51: Categorical data Aalysis 22 / 35

Score Cofidece itervals for Proportios(Wilso score cofidece iterval) I the score cofidece iterval (Wilso s score method), we ˆπ π collect the values of π such that Z = π(1 π) z α/2. This method does ot replace π by ˆπ as i Wald cofidece iterval. We fid the upper ad the lower limits of the cofidece iterval by solvig ˆπ π π(1 π) = ±z α/2. Wald s method replaces π by ˆπ i the stadard deviatio (i.e. π(1 π) ˆπ(1 ˆπ) ) to get ˆπ ± z α/2. ( [ ˆπ This iterval is give by ( z α/2 ˆπ(1 ˆπ) [ 1 +zα/2 2 +z 2 α/2 +z 2 α/2 ) + 1 4 ) ( )] + 1 z 2 α/2 2 ± +zα/2 2 )] ( z 2 α/2 +z 2 α/2 Mahida Samarakoo STAC51: Categorical data Aalysis 23 / 35

Score Cofidece itervals for Proportios(Wilso score cofidece iterval) Note 1: The scores cofidece iterval ca also be iterpreted as the set of values of π 0 for which the the p-value for testig the the ull hypothesis H 0 : π = π 0 agaist the two-sided alterative H 1 : π π 0 usig the test statistic Z = ˆπ π 0 π0 (1 π 0 ) is greater tha α Mahida Samarakoo STAC51: Categorical data Aalysis 24 / 35

Score Cofidece itervals for Proportios(Wilso score cofidece iterval) Note 2: The midpoit of the above iterval is ( ) ( ) ˆπ + zα/2 2 + 1 z 2 α/2 2 + zα/2 2 = y + z2 α/2 /2 + zα/2 2 ad for α = 0.05 y + z 2 α/2 /2 + z 2 α/2 = y + 1.962 /2 + 1.96 2 y + 2 + 4. For this reaso some authors cosider y+z2 α/2 /2 +zα/2 2 estimate of π. as a a improved Mahida Samarakoo STAC51: Categorical data Aalysis 25 / 35

Score Cofidece itervals for Proportios: Example For = 25, y = 0, ˆπ = 0 the Wald cofidece iterval was (0, 0). The [ ( scores ) iterval ( is )] ˆπ + 1 z 2 α/2 +zα/2 2 2 ± +zα/2 2 [ ( ) ( )] 1 z α/2 ˆπ(1 ˆπ) + 1 z 2 α/2 +zα/2 2 +zα/2 2 4 +zα/2 2 = [ 0 + 1 2 = (0, 0.133). ( 1.96 2 25+1.96 2 )] ± 1.96 1 25+1.96 2 [0 + 1 4 ( 1.96 2 25+1.96 2 )] Mahida Samarakoo STAC51: Categorical data Aalysis 26 / 35

Agresti ad Coul Cofidece iterval I the score iterval, we saw that π= y+z2 α/2 /2 +zα/2 2 estimate of π tha ˆπ. is a better Agresti ad Coul (1998) suggest replacig ˆπ i the Wald cofidece iterval by π to get π (1 π) π ±z α/2 where = + z 2 α/2. This iterval is called Aggreti-Coull cofidece iterval. Mahida Samarakoo STAC51: Categorical data Aalysis 27 / 35

Likelihood Ratio Test of H 0 : π = π 0 agaist H 1 : π π 0 Let Y Bi(, π) The the likelihood fuctio is l(π) = π y (1 π) y Likelihood Ratio Test of H 0 : π = π 0 agaist H 1 : π π 0, rejects H 0 for small values of Λ = l(π 0 )/l(ˆπ) i.e. if Λ = l(π 0 )/l(ˆπ) is smaller tha some critical value. Wilks (1938) showed that uder the ull hypothesis H 0 : π = π 0, 2 log Λ has a limitig Chi square distributio with 1 degree of freedom, as Note: I this course we use atural logarithm throughout. We will use this limitig distributio i the likelihood ratio test ad for calculatig approximate cofideces based o the likelihood ratio. Mahida Samarakoo STAC51: Categorical data Aalysis 28 / 35

Likelihood Ratio Test of H 0 : π = π 0 agaist H 1 : π π 0 Note 1: Λ = maximum likelihood whe H 0 is true maximum likelihood with o restrictio Note 2: 2 log Λ = 2 log(l(π 0 )/l(ˆπ)) = 2(L 0 L 1 ) where L 0 = log l(π 0 ) ad L 1 = log l(ˆπ). We use log for atural logarithms. The likelihood ratio test rejects H 0 if 2 log Λ = 2(L 0 L 1 ) > χ 2 (α) where χ 2 (α) is the 100(1 α) upper quatile of the chi square distributio with 1 degree of freedom. The likelihood ratio [ test statistic simplifies to] 2(L 0 L 1 ) = 2 y log ˆπ π 0 + ( y) log 1 ˆπ 1 π 0. This ca also be expressed [ as ] 2(L 0 L 1 ) = 2 y log y π 0 + ( y) log y π 0 ad 2(L 0 L 1 ) = 2 observed log ( ) observed fitted Mahida Samarakoo STAC51: Categorical data Aalysis 29 / 35

Likelihood Ratio Test of H 0 : π = π 0 agaist H 1 : π π 0 ; Example A coi was tossed 32 times ad observed 23 heads. Use the likelihood ratio test to test the ull hypothesis H 0 : π = 0.5 agaist H 1 : π 0.5. Solutio : 2 log Λ = 2(L 0 L 1 ) = 2 observed log ( ) observed ( ( ) ( fitted)) = 2 23 log 23 32 0.5 + (32 23) log 32 23 32 32 0.5 = 6.337098101 > χ 2 1 (0.05) = 3.841 ad so we reject the ull hypothesis. Mahida Samarakoo STAC51: Categorical data Aalysis 30 / 35

Likelihood Ratio based Cofidece itervals for π Likelihood based cofidece iterval for π is the set of values of π 0 for which 2(L(π 0 ) L(ˆπ)) < χ 2 1 (α). We ca fid the boudaries of the iterval by solvig the equatio 2(L(π 0 ) L(ˆπ)) = χ 2 1 (α) or 2(L(π 0 ) L(ˆπ)) χ 2 1 (α) = 0. This ofte requires umerical a solutio to this equatio. Mahida Samarakoo STAC51: Categorical data Aalysis 31 / 35

Likelihood Ratio based Cofidece itervals for π: Example A coi was tossed 32 times ad observed 23 heads. Fid a likelihood ratio test based 95% cofidece iterval for π. Solutio: We get the upper ad the lower limits of the likelihood ratio based cofidece iterval by solvig 2(L(π 0 ) L(ˆπ)) χ 2 1 (α) = 0. Substitutig values, the equatio becomes: 2[23 log(π 0 ) + (32 23) log(1 π 0 ) 23 log(23/32) (32 23) log(1 (23/32))] χ 2 1(0.05) = 0. Mahida Samarakoo STAC51: Categorical data Aalysis 32 / 35

Likelihood Ratio based Cofidece itervals for π: Example The R code ad the output below shows the umerical solutio to this equatio ad the likelihood ratio based cofidece iterval > #R code for Likelihood Ratio based Cofidece iterval > # p 12 Aggresti > library(rootsolve) > <- 32 > y <- 23 > phat <- y/ > alpha <- 0.05 > f1 <- fuctio(pi0) { + -2*(y*log(pi0) + (-y)*log(1-pi0)-y*log(phat) -(-y)*log(1-phat)) - qchisq(1-alpha,df=1) + } > uiroot.all(f=f1, iterval=c(0.000001,0.999999)) [1] 0.5501852 0.8535842 > curve(f1, from=0, to=1, xlab="pi0", ylab="f1(pi0)") > ablie(h=0, col="red") Mahida Samarakoo STAC51: Categorical data Aalysis 33 / 35

Likelihood Ratio based Cofidece itervals for π: Example > curve(f1, from=0, to=1, xlab="pi0", ylab="f1(pi0)") > ablie(h=0, col="red") > ablie(v=(seq(0,1,by=0.02)), col="blue", lty="dotted") > ablie(h=(seq(0,170,10)), col="blue", lty="dotted") Mahida Samarakoo STAC51: Categorical data Aalysis 34 / 35

Likelihood Ratio based Cofidece itervals for π: Example Figure: Likelihood Ratio Cofidece Iterval Mahida Samarakoo STAC51: Categorical data Aalysis 35 / 35