Topics on Statistics 3

Size: px

Start display at page:

Download "Topics on Statistics 3"

Beverly Hood
5 years ago
Views:

1 Topics on Statistics 3 Pejman Mahboubi April 24, Contingency Tables Assume we ask a sample of 1127 Americans if they believe in an afterlife world. The table below cross classifies the sample based on their gender and response. Here, Gender and Response are two categorical YES NO FEMALE MALE variables. Gender has 2 levels Male and Female. Response also has 2 levels Yes and No. In general with two categorical variables X with I levels and Y with J levels, we can build a I J contingency table which display the I J possible combinations of the count outcomes. From the table, we can answer the following questions: 1. Joint Probability of being male and not believing in afterlife. If we decide to let the coordinates denote the response and gender respectively, the we can write P(No, Male) = (1) 2. Conditional Probability of not believing in after life given or conditioned on the respondent is male. Since the total number of males is 502 = , and 104 of them don t believe in afterlife, Conditional probability can be defined based on joint probability. P(No Male) = , (2) Definition 1 (Conditional Probability). For two events A and B with P(B) > 0, P(A B) = P(A, B) P(B), (3) which can also be written as P(A, B) = P(B) P(A B). (4) Example 1. Using definition of the conditional probability, we have 104 P(No, M) P(No M) = P(Gender = M) = =

2 2 Measures of Accuray Assume you train a classifier machine CL which predicts the gender based on subjects religosity. CL(religious) = F CL(non-religious) = M, Assume, we apply this classifier to the test dataset of 20 individuals and get the following result > CL Gender Religiosity prdgender class 1 F Y F TRUE+ 2 F Y F TRUE+ 3 F Y F TRUE+ 4 F Y F TRUE+ 5 F N M FALSE- 6 F Y F TRUE+ 7 F Y F TRUE+ 8 F N M FALSE- 9 F N M FALSE- 10 F Y F TRUE+ 11 F Y F TRUE+ 12 F Y F TRUE+ 13 M N M TRUE- 14 M Y F FALSE+ 15 M N M TRUE- 16 M Y F FALSE+ 17 M N M TRUE- 18 M Y F FALSE+ 19 M Y F FALSE+ 20 M Y F FALSE+ Let s assume, F and M represent the positive and negative classes respectively. For example we predicted that samples on row 5, 8, 9, 13, 15, 17 are + and the rest are. Then, by comparing our predictions with the gender (first column), we can tell which of our predictions were T RUE or F ALSE. This way, we put our predictions in four categories: TRUE+ or TP, TRUE- or TN, FALSE- or FN and FALSE+ or FP. The following cross classification gives us the number of our predictions in each class: > (tbl<-table(cl[,c("gender","prdgender")])) prdgender Gender F M F 9 3 M 5 3 Therefore T P = 9 F N = 3 F P = 5 T N = 3 People in positive class are the ones who are correctly predicted positive or wrongly predicted negative: P = T P + F N N = T N + F P The diagonal elements are the counts of true positive and true negative predictions. One simple and highly intuitive measure of accuracy is accuracy = T P + T N T OT AL = =

3 There is a major issue with this measure, because sometimes it can fool us. Assume an endemic disease infected 95% of a population. Then if we have a sample of 100 ppl, and with no testing, just predict all of them as infected, then we will have a high true positive and 0 true negative. Therefore the accuracy measure of our trivial model would be: accuracy = = Sensitivity and Specificity In the picture blow, you see a square which is divided to two rectangles. The square includes all points in the dataset while the left and right rectangles denote the + and classes. We have a model that predicts the points inside the circle are + and the rest as negative. Then the left half-disk contains the true positive and the right half-disk the false positive. There are different ways of measuring accuracy of classifiers. 1. Sensitivity, Recall or True positive rate is the probability that a positive sample (left rectangle) is predicted as positive (in the disk) Sensitivity = left half-disk left rectangle 2. Specificity is the probability the diagnostic test will show negative, given the subject does not have the disease right rectangle right half-disk Specificity = right rectangle sensitivity = = 0.75 Remark 1. We can also define a false positive rate as 1 specificity. Let s see an example in context of testing for a disease. Assume a screening method for a rare disease has sensitivity =.86 and specificity.88. This means that P(X = 1 Y = 1) =.86, 1 denotes the positive class P(X = 2 Y = 2) =.88, 2 denotes the negative class Furthermore, assume only 1% of population is infected by this disease. A person takes the test and X = 1 (test is positive). What is the probability that he has the disease, Y = 1. Solution. Here we want to find P(Y = 1 X = 1). By the Bayes rule P(Y = 1 X = 1) = The numerator is readily computed as P(X = 1 Y = 1) P(Y = 1) P(X = 1) P(X = 1 Y = 1) P(Y = 1) = =.0086 For denominator we have, P(X = 1) = P(X = 1 Y = 1) P(Y = 1) + P(X = 1 Y = 2) P(Y = 2) = = , which still a very small number. Here, P(Y = 1) =.01 is called Bayesian prior and is the posterior. 3

where Y = 1 means the subject has the disease and X = 1 means the result of the test is positive. Another way of measuring accuracy of classifiers is by Precision and Recall.

4 where Y = 1 means the subject has the disease and X = 1 means the result of the test is positive. Another way of measuring accuracy of classifiers is by Precision and Recall. Using the picture on the right we have Precision = T P T P + F P In our example, Recall=.75 and Precision is Recall = T P T P + F N = Sensitivity Precision = 9 = (5) Function confusionmatrix() from library{caret} takes prediction columns and the true values and computes the contingency table, precision, recall, sensitivity, specificity and much more. You see that it generates a confidence interval for the accuracy. It is because the test data set is a random sample. > library(caret) > confusionmatrix(cl$prdgender,cl$gender) Confusion Matrix and Statistics Reference Prediction F M F 9 5 M 3 3 Accuracy : % CI : (0.3605, ) No Information Rate : 0.6 P-Value [Acc > NIR] : Kappa : Mcnemar's Test P-Value : Sensitivity : Specificity : Pos Pred Value : Neg Pred Value : Prevalence : Detection Rate : Detection Prevalence : Balanced Accuracy : 'Positive' Class : F There is a trade off between Precision and Recall in the sense that, if we try to improve one of them in our model, the other will decrease. 3 Marginal Probabilities and Independence Remember the result of the survey: > table(df) 4

5 Response Gender NO YES FEMALE MALE We can normalize the table by dividing each cell by total number of participants, i.e, 1127, to define a joint probability on the product space of G R as follows > prop.table(table(df)) Response Gender NO YES FEMALE MALE This means that P gives probabilities to pairs of gender and response. For example P(M, N) =.0923 We can normalize the table in different ways. For example, if we divide the first row and second row by the their corresponding total numbers, > (G.cond<-prop.table(table(df),1)) Response Gender NO YES FEMALE MALE we get conditional probabilities conditioned on Gender. The first row is conditioned on Gender = F and the second row conditioned on Gender = M and we have P(N F ) = P(Y M) = Similarly we can condition on the response (probabilities in each column adds up to 1) > (R.cond<-prop.table(table(df),2)) Response Gender NO YES FEMALE MALE P(F N) = = 1 P(M N) The third way of normalizing is marginalizing. For example the marginal probability of GENDER is > prop.table(table(df$gender)) FEMALE MALE or > prop.table(table(df$response)) NO YES

6 We can compute the marginal probabilities from the joint probabilities. For example P(F ) = P(F, Y ) + P(F, N) = = , because events R = Y and R = N partition the sample space, i.e., every subject falls in one of theses two sets and no subject falls in both events: P(Y N) = 1 P(Y, N) = 0. (6) This is an example of Law of Total probability. To state this law formally, we need to give a definition of a partition. Definition 2. A collection of events A 1,, A n form a partition of the sample space if they satisfy the following two conditions 1. They are mutually disjoint: P(A i, A j ) = 0 i j (7) 2. They cover the entire sample space together Theorem 1 (Law of Total Probability). Let A 1,, A n be a partition of a sample space. Then for any event B, S = A 1 A n (8) P(B) = P(B, A 1 ) + + P(B, A n ) sum of joint probabilities (9) P(B) = P(A 1 )P(B A 1 ) + + P(A n )P(B A n ). (10) So the marginal probabilities of an event A is obtained by summing up all joint probabilities whose one of the inputs (margins) is A. In the plot (left) A 1, A 2, A 3 form a partition for S S the sample space S. Equation (10) is also referred to as the Law of Total Conditional Probability, which is A1 A2 A3 B readily derived from (9) using the identity P (B, A n ) = P(A n )P(B A n ), B1 B2 B3 S see definition of the conditional probability and (4). You can check that a conditional probability can be derived by dividing joint probability by marginal probability. For example check that P(N F ) = P(N, F ) P(F ) = = Independence Independence of Two Events Two events A, and B are independent with respect to a probability P : S [0, 1] if P(A, B) = P(A) P(B), which is equivalent to P(A B) = P(A), We interpret the last one as information about B doesn t doesn t change information about A. 6

7 3.1.2 Independent Random Variables Give two categorical variables X with I levels and Y with J, and joint probability density P(X = i, Y = j) let s define the following notations: We also define notations for marginals: π ij = P(X = i, Y = j) i = 1,, I and j = 1,, J (11) π i+ = j π ij = j P(X = i, Y = j) = P(X = i) for i = 1,, I (12) π +j = i π ij = i P(X = i, Y = j) = P(Y = j) for j = 1,, J. (13) levels are independent, if for any i {1,, I} and j {1,, J}, the joint probability of the events equals the product of the marginals: or, using definition of conditional probability, i.e., conditional probability equals the marginal probability! P(X = i, Y = j) = P(X = i) P(Y = j) (14) P(X = i Y = j) = P(X = i) i and j, (15) Example 2. There are 100 blue, black and red balls in a jar. Each ball is either wooden or glass. The cross classification is given above. Is color independent of type? Blue Black Red Glass Wood Solution. Joint probabilities are Blue Black Red Glass Wood Marginals are > (m1<-apply(joint,1,sum)) Glass Wood > (m2<-apply(joint,2,sum)) Blue Black Red The product holds. 7

8 YES NO FEMALE MALE Table 1: Cross Classifying Contingency Table NO YES FEMALE MALE Table 2: Conditional Probabilities 4 Comparing Probabilities in 2 2 Contingency Tables In our 2 2 contingency table, think of levels of gender (Female, Male) as the explanatory random variable or groups that predict the response variable (Yes, No). Then we can think of p 1 = 0.81 and p 2 = 0.79 as probabilities of success (Yes) in each groups. Remark 2. Here we tacitly assume that we are taking number of males and females fixed (non-random). Our analysis doesn t say if we repeat the sample, what would be our best guess for number of males and females. It only analyzes the range of probabilities p 1 and p 2 in each group. Let π 1 and π 2 denote the true rates of Response = Y ES in the female and male populations respectively. Then Remark 2 implies that Y ES and NO responses in each group follows Bernoulli distributions with parameters π 1 and π 2 which are unknown to us. Remark 3. If B is a Bernoulli random variable with parameter p, then mean(b) = p and V ar(b) = p(1 p). We know the sample rates are p 1 = 0.81 and p 2 = Can we compute a 95% confidence interval for π 1 = π 2? The rate of success p 1 = , where we put 1 and 0 for Y ES and NO responses respectively, for 402 male participants. Therefore, we can think of p 1 and p 2 as random sample means. By the Central Limit Theorem, we can assume they are sampled from two normal random variables p 1 and p 2, that are distributed normally. More precisely, p 1 N(π 1, σ1) 2 p 2 N(π 2, σ2), 2 (16) where σ1 2 = π1(1 π1) 625 and σ2 2 = π2(1 π2) 502. Since we don t have π 1 and π 2, we use p 1 and p 2 as an approximation and for π 1 and π 2 and write s 2 1 and s 2 2 instead of σ1 2 and σ2. 2 Therefore we have > p1=0.81;p2<-0.79 > (s1<-p1*(1-p1)/625) [1] > (s2<-p2*(1-p2)/502) [1] Therefore p 1 N(0.81, ) p 2 N(0.79, ). (17) Now, let s discuss p 1 p 2. But first a theorem! In the following theorem pay attention that the variance is always the sum of the variances. Theorem 2. If X 1 and X 2 are independent normal random variables with parameters mean and variance (m 1, σ 2 1) and (m 2, σ 2 2), then X 1 ± X 2 is normal with parameters (m 1 ± m 2, σ σ 2 2) Therefore, p 1 p 2 is a normal distribution with mean = =.02 and Therefore, the 95% confidence interval is SE = = (18) [ , ] = [ 0.028, 0.068] We can also perform hypothesis testing. Assume we want to check if H 0 : π 1 = π 2 vs H 1 : π 1 π 2. 8

9 Remark 4. π 1 = π 2 means P(R = Y F ) = P(R = Y M). Then P(Y, F )P(M) = P(Y, M)P(F ). Therefore, P(Y, F ) P(Y, F )P(F ) = P(Y, M)P(F ), which implies that P(Y,F ) P(F ) = P(Y, M) + P(Y, F ). Therefore P(Y F ) = P(F ), which implies that P(N F ) = P(N). Similarly, we can check that P(Y M) = P(Y ), which implies that P(N Y ) = P(N). Therefore, we have Response is independent of Gender. Remember how we approximated σ1 2 and σ2 2 in (16) by s 2 1 and s 2 2. In hypothesis testing, under the null hypothesis π 1 = π 2 we can do a better job, thanks to Remark 4. The pooled variance is a common variance σ 2, closely related to the between variation in AOV, replacing both σ1 2 and σ2. 2 If π 1 = π 2, then as mentioned in Remark 4, Response is independent of Gender. Therefore, two samples are from the same populations. Therefore, σ1 2 = σ2 2 = σ 2. And σ 2 is the average of variances based on the samples we have. It is calculated as, (n 1 1)s 2 1 s p = + (n 2 1)s 2 2 = , n 1 + n 2 2 compare with (18). If π 1 = π 2, then p 1 p 2 N(0, ). The number we sampled is p 1 p 2 =.02. What is the chance to with N(0, ) sample a number.02 or further distance from the center 0? > 2*pnorm(q = -.02,mean = 0,sd = ,lower.tail = T) [1] Therefore, we cannot reject the null hypothesis. 5 Odds and Odds Ratio If π is the rate of success in a binomial trial, then its corresponding odds is defined to be odds = π 1 π. (19) If odds = 4, then success is 4 times as likely as a failure. We expect to see, on average, 4 successes for each failure. We of course can retrieve π if we know its corresponding odds by π = odds/(odds + 1). Every 2 2 contingency table induces two rates of success π 1 and π 2 corresponding to its rows. Let odds 1 and odds 2 be the odds corresponding to π 1 and π 2. By dividing the odds 1 by odds 2 we find another measure of association between the rows. This measure, denoted by θ, is called the odds ratio and is defined by θ = odds1 odds2 = π 1/(1 π 1 ) π 2 /(1 π 2 ). (20) Odds ratios are positive numbers in interval θ (0, ). θ = 4 means the odds of group in the first row is 4 times the odds of the group in second row. θ = 1 4 means the opposite is true, i.e., the odds of group in the second row is 4 times the odds of the group in first row. θ = 1 means the odds are equal, which implies that π 1 = π 2. In general θ > 1 implies π 1 > π 2, θ = 1, implies π 1 = π 2 and θ < 1 implies that π 1 < π 2. Furthermore, for any positive number α > 0, θ = α and θ = 1/α are convey opposite implications about odds of the 2 groups. As we always do in statistics, we have only the sample odds, which is defined by ˆθ = p 1/(1 p 1 ) p 2 /(1 p 2 ) Consider two population with equal odds. Then the sampling odds ratio will be around 1. You can see that the left tail is in (0, 1) and right tail in (1, ). Therefore, the sampling distribution of the odds ratio is highly skewed. But if we consider log θ instead of θ, then we will have nicer, and more intuitive properties. for example 9

10 log θ = 0 (i.e θ = 1) implies π 1 = π 2. log θ = 2 and log θ = 2 are symmetric around 0 and convey opposite statement about π 1 and π 2. The sample log odds ratio, log ˆθ, has a less skewed sampling distribution that is bell-shaped with standard deviation given by 1 SE = , (21) n 11 n 12 n 21 n 22 where n ij are the counts in the contingency table. Example 3. In our contingency table of the afterlife belief, compute log ˆθ and a 95% confidence interval for the log θ. Solution. Since the sample logˆθ and standard deviation are > odds.f< /( ) > odds.m< /( ) > (p<-log(odds.f/odds.m)) [1] > (SE=sqrt((1/509)+(1/116)+(1/398)+(1/104))) [1] Then the lower and upper limits of the 95% CI are > (lower<-p-1.96*se) [1] > (upper<-p+1.96*se) [1] Since zero is included in the interval then log θ = 0 is a possibility. Therefore, π 1 = π 2 is a possibility with 95% chance. By exponentiating, we find that the 95% CI for θ is [ , ] 5.1 Contingency Tables and Chi-Square test A 2000 General Social Survey, cross classifies 2757 subjects based on gender and their political party as below This table defines a sample joint probability p = {p 11, p 12, p 13, p 21, p 22, p 23 } that is Democrat Independent Republican Female Male Democrat Independent Republican Female Male Of course p is random as it is defined by a random sample. Does there enough evidence there to reject H 0 defined by H 0 : π = {0.25, 0.1, 0.25, 0.15, 0.1, 0.15}? 10

Democrat Independent Republican Female 689.25 275.70 689.25 Male 413.55 275.70 413.55 Solution. The expected number for each cell µ = µ ij, based on π = π ij is obtained by µ = π 2757.

11 Democrat Independent Republican Female Male Solution. The expected number for each cell µ = µ ij, based on π = π ij is obtained by µ = π So we have There are 6 residuals, which are the difference between the expected(fitted) value and sample(actual) value. The residual squares are > sample<-c(762,327,468,484,239,477) > expected<-c(689.25,275.70,689.25,413.55,275.70,413.55) > res.sq<-c(sample-expected)^2 Bigger residuals are stronger evidences against H Chi-squared Distribution Democrat Independent Republican Female Male Chi-squared distribution also denoted by χ 2 distribution with k degrees of freedom, is the sum of square of k independent standard normal distributions. Think of residuals in a contingency table. They are approximately normal and after dividing by their standard deviation they become standard normal. Definition 3 (Wikipedia). If Z 1,, Z k are independent, standard normal random variables, then the sum of their squares, k Q = Zi 2 (22) i=1 is distributed according to the chi-squared distribution with k degrees of freedom. This is usually denoted as Q χ 2 k. Furthermore, if X χ 2 k, then EX = k and V arx = 2k. 11

12 The graph of the densities above shows how it become closer to a normal density as degrees of freedom increases. In the discussion that comes next, we will talk about the mean and variance of the frequencies of each cell, not to be confused with the mean and variance of Q. 5.3 Simulating The Contingency Table Assume we know the population probabilities of each cell: π = {π 11, π 12, π 13, π 21, π 22, π 23 }. Then the counts of these 6 cells follow a multinomial distribution. For example with 1000 people we might get > set.seed(1001) > pi<-c(.1,.3,.2,.1,.2,.1) > r=14 > N=1000 > (sample<-rmultinom(n = r,size = N,prob = pi)) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [1,] [2,] [3,] [4,] [5,] [6,] Each column is a random sample for the 6 cells of the contingency table and each row are 10 samples for one of the 6 cells. Check that each cell looks normal with mean = N π and sd = N π(1 π) > (mean<-1000*pi[1]) [1] 100 > (sd=sqrt(1000*pi[1]*(1-pi[1]))) [1] > hist(sample[1,]) Histogram of sample[1, ] Frequency sample[1, ] 12

13 So, each cell cell i,j is a binomial with parameter π ij, with mean equal to Nπ ij and variance Nπ ij (1 π ij ), see Theorem 3. Therefore, by CLT, (o Nπ ij )/ Nπ ij (1 π ij ) N(0, 1). 5.4 A Two-Cell Model Let s see what this yields when there are only two cells, one row and 2 columns, i.e, j = 1 and i = 1, 2. Let π 1 and π 2 denote probabilities of cell 1 and cell 2: π 1 + π 2 = 1. Furthermore, let 1. E i = Nπ i denote the mean (Expected value) of the cell i. 2. O i the observation in cell i. Therefore, O 1 + O 2 = N. 3. Define Q i = Oi Nπi. Nπi(1 π i) By CLT, Q 1 is sampled from an approximately normal distribution, see Theorem 3. And Q 2 1 is Q 2 1 = (O 1 E 1 ) 2 Nπ(1 π) = (O 1 Nπ) 2 Nπ after some algebraic manipulations + (O 2 N(1 π)) 2 N(1 π) = (O 1 E 1 ) 2 E 1 + (O 2 E 2 ) 2 E 2. Therefore, in a two cell model, if we compute (Oi Ei)2 E i for each cell and add them up, the result is a χ 2 1. Theorem 3. If X is a binomial random variable with parameters N and π, where N is the number of trials and π the rate of success, then EX = Nπ sdx = Nπ(1 π) (23) 5.5 Six-cell Model Same holds when there are 6 cells. We want to test the null hypothesis where H 0 and H 1 are given by H 0 : π = (π 1,, π 6 ), H 1 : π is not as given by H 0 (24) and we have O 1,, O 6 observations. Furthermore the total number of observations N = O O 6 can be computed. 1. Compute E i = Nπ i, for i = 1,, 6 2. Compute (Oi Ei)2 E i for i = 1,, 6, 3. Compute the test statistic Q = 6 (O i E i) 2 i=1 E i Q is a number, which under H 0 is sampled from χ 2 5 : Q χ 2 5. Check the table to see what is the chance p v al value that one would sample Q or bigger from χ 2 5. If p v al <.05, you reject H Example Continued, Goodness of the Fit For our cross classification of the gender-political party, we computed all the summands. If we add them up we get > (chi.sq<-sum(res.sq/expected)) [1] This number is sampled from χ 2 5. What is the chance that χ 2 5 generates or bigger? 13

14 > (p_val<-pchisq(q = chi.sq,df = 5,lower.tail = FALSE)) [1] e-23 And we reject H 0. This is an example of checking the goodness of the fit using chi-squared test. We are given observation and our model is basically defined by π 1,, π 6, or same as π ij, i = 1,, 3 j = 1, 2. In this case we concluded that our model defined by H 0 is not appropriate. Example 4. Assume we are given 20 numbers and we want to see if it is acceptable to assume they are sampled from a normal distribution. Let s assume the numbers are [1] [20] 120 > hist(t) Histogram of t Frequency t Solution. If the numbers are from a normal distribution, then the mean and variance would be > (m<-mean(t)) [1] > (sd<-sd(t)) [1] The range starts from 88 and ends with biggest number120. Let s make 3 cells. Cell 2 for all observations within one standard deviation of the mean, i.e., all observations in interval Cell 2 = [ , ]. Cell 3 = ( , ) and Cell 1 = (, ]. If numbers are from N(102, 10), then we can find probability of each cell. Actually we know from the picture below that > pi<-c(.16,.68,.16) > o1<-sum(t<(m-sd));o3<-sum(t>(m+sd));o2<-(length(t)-(o1+o3)) > (ob<-c(o1,o2,o3)) 14

15 [1] > (E<-pi*20) [1] > (res<-((ob-e)^2)/e) [1] > (chi.sq<-sum(res)) [1] There are 3 cells, therefore, there are 2 degrees of freedom. Is χ 2 2 =.588 too big? > (p_val<-pchisq(q = chi.sq,df = 2,lower.tail = FALSE)) [1] No, we cannot reject the possibility that numbers are sampled from N(102, 10). 5.7 Test of Independence We dealt with independence at Example 2. What is different here? In Example 2 we had access to the entire population (the jar of the balls) and could compute π ij for each cell. Here, we have a sample, and need a more powerful theory to infer about the population probabilities. We cannot apply the definition of independence to the probabilities p derived from the contingency table, as they are estimates of π ij at best and fluctuate Structure of H 0 In the χ 2 test of independence, H 0 is different than H 0 for goodness of the fit as in (24). Here, instead of joint probabilities π ij, we are given the observations. Then we can add up observations in columns and rows to compute the marginals, π i+ and π +j, see (12) Degrees of Freedom Furthermore, when testing the goodness of the fit, the only restrain on joint probabilities π ij is that π ij = 1, (25) ij therefore, there are I J 1 degrees of freedom. When testing independence marginals are computed from the observations. Using marginals, we compute joint probabilities under the independence condition, see (14). Therefore, instead of ij π ij = 1, joint probabilities in each row and each column should add up to the first marginals and the second marginals. Therefore, the degrees of freedom are (I 1) (J 1) How It Works Assume there are I rows and J columns. Therefore, i ranges in 1,..., I and j in 1,, J. We are given observations O ij for all i, j. Therefore, we can compute the sample marginals p i+ and p +j. Use the sample marginal as an approximation for π i+ and π +j. Use π i+ and π +j to compute the joint probabilities π ij under the independence condition. Then We want to test the hypothesis that. Finally, test the hypothesis that the observations is consistent with the joint probabilities H 0 : O ij is sampled from π ij for all i, j H 1 : O ij is not sampled from π ij at least for one i, j 15

16 Solution. We discuss below the procedure step-by-step. Remember that we are only given the observations O ij 1. Let O = ij O ij denote the total number of observations. 2. Add observations in each row and divide by O to obtain row marginals π 1+,, π I+. 3. Add observations in each column and divide by O to obtain column marginals π +1,, π +J. 4. Under independence, we can compute the joint probabilities π ij = π i+ π +j. 5. Compute all the expected observations E ij = π ij O. 6. Compute (Oij Eij)2 E ij for all ij. 7. Compute the statistic test: Q = ij (O ij E ij ) 2 E ij (26) 8. Compute p value p v al, the right tail of χ 2 df (J 1). that is bigger than Q, using degrees of freedom (I 1) Let me sample from from the jar in Example 2. The sample is with replacement, so size of the sample could be bigger than 100. Test the hypothesis that H 0 : Color of the balls is independent of its type. Blue Black Red Glass Wood To be able to use R, we store these numbers in a matrix: > (a<-matrix(data = c(2,6,8,7,14,13),nrow = 2,byrow = TRUE, + dimnames = list(c("glass","wood"),c("blue","black","red")))) Blue Black Red Glass Wood Total number of observation is : O = = First marginal is m.1 = [( )/50, ( )/50]: > (m.1<-apply(x = a,margin = 1,FUN = sum)/50) Glass Wood second marginal: > (m.2<-apply(x = a,margin = 2,FUN = sum)/50) Blue Black Red Joint probabilities: 16

17 Blue Black Red Glass Wood Expected observations > (E<-jp*50) Blue Black Red Glass Wood Compute residuals squared divided by expected value > (R<-((E-a)^2)/E) Blue Black Red Glass Wood Compute Q: > (chi.sq<-sum(r)) [1] degrees of freedom is (2 1) (3 1) = 2 9. Compute p value > (p_val<-pchisq(q = ,df = 2,lower.tail = FALSE)) [1] We cannot reject independence OR YOU CAN SIMPLY FEED YOUR DATA TO THE FOLLOWING COMMAND IN R > chisq.test(a) Pearson's Chi-squared test data: a X-squared = , df = 2, p-value = ROC curve and AUC Assume an endemic affected 10% of a population. We designed a classifiers that generate probabilities p of being in positive class (diseased). We gather two populations of 500 healthy and 50 diseased patients and look at he distribution of scores that the classifier spits out for each group. Assume we get the following results > H.score<-rnorm(500,.3,.15) > S.score<-rnorm(50,.7,.1) Lets look at the overlapping distributions of the two groups of scores in a plot. The vertical line is a threshold which indicates the prediction rule, scores on the right (bigger than the threshold) are predicted sick and scores on the left are predicted healthy. Then 1. Red on the left of the vertical line means : TRUE NEGATIVE 17

18 2. Red on the right of the vertical line means: FALSE POSITIVE 3. Green on the right of the vertical line means: TRUE POSITIVE 4. Green on the left of the vertical line means: FALSE NEGATIVE We can place the vertical line at any x (0, 1) and compute the corresponding true positive rate and false positive rate. The plot in false positive-true positive space is a curve. Assume from the data we know the positive class and negative class (this is not based on prediction, but based on the label of the data). For example H.score has 500 members. Therefore we know that TN+FP=500. Similarly, there are 50 sick people which means TP+FN=50. Before we proceed further to compute the rates, let s redefine S.score and H.score. > library(ggplot2) > dat<-data.frame(dens = c(h.score,s.score),lines=rep(c("healthy", "Sick"), c(500,50))) > ggplot(dat, aes(dens, fill = lines)) + geom_histogram(position = "dodge")+ + geom_vline(xintercept =.6) 40 count lines Healthy Sick dens 18

19 > H.score<-rnorm(500,.4,.2) > S.score<-rnorm(50,.6,.2) Therefore for a given threshold, say.6, we have > threshold=.6 > lp<-length(pclass<-s.score) > ln<-length(nclass<-h.score) > c(tpr=sum(pclass>threshold)/lp,fpr=sum(nclass>threshold)/ln) tpr fpr where tpr and fpr stand for true positive rate and false positive rate respectively. We can compute tpr and fpr for different thresholds > ts<-seq(from =.01,to=.99,by=.01) > K<-unname(unlist(lapply(X = ts,fun = function(threshold) + c(tpr=sum(pclass>threshold)/length(pclass), + fpr=sum(nclass>threshold)/length(nclass))))) Then we can separate tpr from fpr > tpr<-k[seq(from=1,to = 197,by = 2)] > fpr<-k[seq(from=2,to = 198,by = 2)] Then we can plot it > plot(fpr,tpr,type = 'l') tpr fpr The curve produced this way is called ROC curve and the area under the curve equals to the probability that 19

20 the classifier will rank a randomly chosen positive example higher than a randomly chosen negative example.[from This fact allows us to compute the area under the curve by sampling and counting > p = replicate(50000, sample(pclass, size=1) > sample(nclass, size=1)) > (mean(p)) [1] Let s repeat the process with a modified classifier which gives different (better) scores > H.score<-rnorm(500,.4,.15)->nclass > S.score<-rnorm(50,.6,.12)->pclass > ts<-seq(from =.01,to=.99,by=.01) > K<-unname(unlist(lapply(X = ts,fun = function(threshold) + c(tpr=sum(pclass>threshold)/length(pclass), + fpr=sum(nclass>threshold)/length(nclass))))) Then we can separate tpr from fpr > tpr<-k[seq(from=1,to = 197,by = 2)] > fpr<-k[seq(from=2,to = 198,by = 2)] Then we can plot it > plot(fpr,tpr,type = 'l') tpr fpr 20

21 > p = replicate(50000, sample(pclass, size=1) > sample(nclass, size=1)) > (mean(p)) [1] You can play with the distributions of pclass and nclass to repeat the procedure above to see that as scores become more concentrated and more separate the area under the curve increases. 21

ST3241 Categorical Data Analysis I Two-way Contingency Tables. 2 2 Tables, Relative Risks and Odds Ratios

ST3241 Categorical Data Analysis I Two-way Contingency Tables 2 2 Tables, Relative Risks and Odds Ratios 1 What Is A Contingency Table (p.16) Suppose X and Y are two categorical variables X has I categories