Topics on Statistics 3

Size: px
Start display at page:

Download "Topics on Statistics 3"

Transcription

1 Topics on Statistics 3 Pejman Mahboubi April 24, Contingency Tables Assume we ask a sample of 1127 Americans if they believe in an afterlife world. The table below cross classifies the sample based on their gender and response. Here, Gender and Response are two categorical YES NO FEMALE MALE variables. Gender has 2 levels Male and Female. Response also has 2 levels Yes and No. In general with two categorical variables X with I levels and Y with J levels, we can build a I J contingency table which display the I J possible combinations of the count outcomes. From the table, we can answer the following questions: 1. Joint Probability of being male and not believing in afterlife. If we decide to let the coordinates denote the response and gender respectively, the we can write P(No, Male) = (1) 2. Conditional Probability of not believing in after life given or conditioned on the respondent is male. Since the total number of males is 502 = , and 104 of them don t believe in afterlife, Conditional probability can be defined based on joint probability. P(No Male) = , (2) Definition 1 (Conditional Probability). For two events A and B with P(B) > 0, P(A B) = P(A, B) P(B), (3) which can also be written as P(A, B) = P(B) P(A B). (4) Example 1. Using definition of the conditional probability, we have 104 P(No, M) P(No M) = P(Gender = M) = =

2 2 Measures of Accuray Assume you train a classifier machine CL which predicts the gender based on subjects religosity. CL(religious) = F CL(non-religious) = M, Assume, we apply this classifier to the test dataset of 20 individuals and get the following result > CL Gender Religiosity prdgender class 1 F Y F TRUE+ 2 F Y F TRUE+ 3 F Y F TRUE+ 4 F Y F TRUE+ 5 F N M FALSE- 6 F Y F TRUE+ 7 F Y F TRUE+ 8 F N M FALSE- 9 F N M FALSE- 10 F Y F TRUE+ 11 F Y F TRUE+ 12 F Y F TRUE+ 13 M N M TRUE- 14 M Y F FALSE+ 15 M N M TRUE- 16 M Y F FALSE+ 17 M N M TRUE- 18 M Y F FALSE+ 19 M Y F FALSE+ 20 M Y F FALSE+ Let s assume, F and M represent the positive and negative classes respectively. For example we predicted that samples on row 5, 8, 9, 13, 15, 17 are + and the rest are. Then, by comparing our predictions with the gender (first column), we can tell which of our predictions were T RUE or F ALSE. This way, we put our predictions in four categories: TRUE+ or TP, TRUE- or TN, FALSE- or FN and FALSE+ or FP. The following cross classification gives us the number of our predictions in each class: > (tbl<-table(cl[,c("gender","prdgender")])) prdgender Gender F M F 9 3 M 5 3 Therefore T P = 9 F N = 3 F P = 5 T N = 3 People in positive class are the ones who are correctly predicted positive or wrongly predicted negative: P = T P + F N N = T N + F P The diagonal elements are the counts of true positive and true negative predictions. One simple and highly intuitive measure of accuracy is accuracy = T P + T N T OT AL = =

3 There is a major issue with this measure, because sometimes it can fool us. Assume an endemic disease infected 95% of a population. Then if we have a sample of 100 ppl, and with no testing, just predict all of them as infected, then we will have a high true positive and 0 true negative. Therefore the accuracy measure of our trivial model would be: accuracy = = Sensitivity and Specificity In the picture blow, you see a square which is divided to two rectangles. The square includes all points in the dataset while the left and right rectangles denote the + and classes. We have a model that predicts the points inside the circle are + and the rest as negative. Then the left half-disk contains the true positive and the right half-disk the false positive. There are different ways of measuring accuracy of classifiers. 1. Sensitivity, Recall or True positive rate is the probability that a positive sample (left rectangle) is predicted as positive (in the disk) Sensitivity = left half-disk left rectangle 2. Specificity is the probability the diagnostic test will show negative, given the subject does not have the disease right rectangle right half-disk Specificity = right rectangle sensitivity = = 0.75 Remark 1. We can also define a false positive rate as 1 specificity. Let s see an example in context of testing for a disease. Assume a screening method for a rare disease has sensitivity =.86 and specificity.88. This means that P(X = 1 Y = 1) =.86, 1 denotes the positive class P(X = 2 Y = 2) =.88, 2 denotes the negative class Furthermore, assume only 1% of population is infected by this disease. A person takes the test and X = 1 (test is positive). What is the probability that he has the disease, Y = 1. Solution. Here we want to find P(Y = 1 X = 1). By the Bayes rule P(Y = 1 X = 1) = The numerator is readily computed as P(X = 1 Y = 1) P(Y = 1) P(X = 1) P(X = 1 Y = 1) P(Y = 1) = =.0086 For denominator we have, P(X = 1) = P(X = 1 Y = 1) P(Y = 1) + P(X = 1 Y = 2) P(Y = 2) = = , which still a very small number. Here, P(Y = 1) =.01 is called Bayesian prior and is the posterior. 3

4 where Y = 1 means the subject has the disease and X = 1 means the result of the test is positive. Another way of measuring accuracy of classifiers is by Precision and Recall. Using the picture on the right we have Precision = T P T P + F P In our example, Recall=.75 and Precision is Recall = T P T P + F N = Sensitivity Precision = 9 = (5) Function confusionmatrix() from library{caret} takes prediction columns and the true values and computes the contingency table, precision, recall, sensitivity, specificity and much more. You see that it generates a confidence interval for the accuracy. It is because the test data set is a random sample. > library(caret) > confusionmatrix(cl$prdgender,cl$gender) Confusion Matrix and Statistics Reference Prediction F M F 9 5 M 3 3 Accuracy : % CI : (0.3605, ) No Information Rate : 0.6 P-Value [Acc > NIR] : Kappa : Mcnemar's Test P-Value : Sensitivity : Specificity : Pos Pred Value : Neg Pred Value : Prevalence : Detection Rate : Detection Prevalence : Balanced Accuracy : 'Positive' Class : F There is a trade off between Precision and Recall in the sense that, if we try to improve one of them in our model, the other will decrease. 3 Marginal Probabilities and Independence Remember the result of the survey: > table(df) 4

5 Response Gender NO YES FEMALE MALE We can normalize the table by dividing each cell by total number of participants, i.e, 1127, to define a joint probability on the product space of G R as follows > prop.table(table(df)) Response Gender NO YES FEMALE MALE This means that P gives probabilities to pairs of gender and response. For example P(M, N) =.0923 We can normalize the table in different ways. For example, if we divide the first row and second row by the their corresponding total numbers, > (G.cond<-prop.table(table(df),1)) Response Gender NO YES FEMALE MALE we get conditional probabilities conditioned on Gender. The first row is conditioned on Gender = F and the second row conditioned on Gender = M and we have P(N F ) = P(Y M) = Similarly we can condition on the response (probabilities in each column adds up to 1) > (R.cond<-prop.table(table(df),2)) Response Gender NO YES FEMALE MALE P(F N) = = 1 P(M N) The third way of normalizing is marginalizing. For example the marginal probability of GENDER is > prop.table(table(df$gender)) FEMALE MALE or > prop.table(table(df$response)) NO YES

6 We can compute the marginal probabilities from the joint probabilities. For example P(F ) = P(F, Y ) + P(F, N) = = , because events R = Y and R = N partition the sample space, i.e., every subject falls in one of theses two sets and no subject falls in both events: P(Y N) = 1 P(Y, N) = 0. (6) This is an example of Law of Total probability. To state this law formally, we need to give a definition of a partition. Definition 2. A collection of events A 1,, A n form a partition of the sample space if they satisfy the following two conditions 1. They are mutually disjoint: P(A i, A j ) = 0 i j (7) 2. They cover the entire sample space together Theorem 1 (Law of Total Probability). Let A 1,, A n be a partition of a sample space. Then for any event B, S = A 1 A n (8) P(B) = P(B, A 1 ) + + P(B, A n ) sum of joint probabilities (9) P(B) = P(A 1 )P(B A 1 ) + + P(A n )P(B A n ). (10) So the marginal probabilities of an event A is obtained by summing up all joint probabilities whose one of the inputs (margins) is A. In the plot (left) A 1, A 2, A 3 form a partition for S S the sample space S. Equation (10) is also referred to as the Law of Total Conditional Probability, which is A1 A2 A3 B readily derived from (9) using the identity P (B, A n ) = P(A n )P(B A n ), B1 B2 B3 S see definition of the conditional probability and (4). You can check that a conditional probability can be derived by dividing joint probability by marginal probability. For example check that P(N F ) = P(N, F ) P(F ) = = Independence Independence of Two Events Two events A, and B are independent with respect to a probability P : S [0, 1] if P(A, B) = P(A) P(B), which is equivalent to P(A B) = P(A), We interpret the last one as information about B doesn t doesn t change information about A. 6

7 3.1.2 Independent Random Variables Give two categorical variables X with I levels and Y with J, and joint probability density P(X = i, Y = j) let s define the following notations: We also define notations for marginals: π ij = P(X = i, Y = j) i = 1,, I and j = 1,, J (11) π i+ = j π ij = j P(X = i, Y = j) = P(X = i) for i = 1,, I (12) π +j = i π ij = i P(X = i, Y = j) = P(Y = j) for j = 1,, J. (13) levels are independent, if for any i {1,, I} and j {1,, J}, the joint probability of the events equals the product of the marginals: or, using definition of conditional probability, i.e., conditional probability equals the marginal probability! P(X = i, Y = j) = P(X = i) P(Y = j) (14) P(X = i Y = j) = P(X = i) i and j, (15) Example 2. There are 100 blue, black and red balls in a jar. Each ball is either wooden or glass. The cross classification is given above. Is color independent of type? Blue Black Red Glass Wood Solution. Joint probabilities are Blue Black Red Glass Wood Marginals are > (m1<-apply(joint,1,sum)) Glass Wood > (m2<-apply(joint,2,sum)) Blue Black Red The product holds. 7

8 YES NO FEMALE MALE Table 1: Cross Classifying Contingency Table NO YES FEMALE MALE Table 2: Conditional Probabilities 4 Comparing Probabilities in 2 2 Contingency Tables In our 2 2 contingency table, think of levels of gender (Female, Male) as the explanatory random variable or groups that predict the response variable (Yes, No). Then we can think of p 1 = 0.81 and p 2 = 0.79 as probabilities of success (Yes) in each groups. Remark 2. Here we tacitly assume that we are taking number of males and females fixed (non-random). Our analysis doesn t say if we repeat the sample, what would be our best guess for number of males and females. It only analyzes the range of probabilities p 1 and p 2 in each group. Let π 1 and π 2 denote the true rates of Response = Y ES in the female and male populations respectively. Then Remark 2 implies that Y ES and NO responses in each group follows Bernoulli distributions with parameters π 1 and π 2 which are unknown to us. Remark 3. If B is a Bernoulli random variable with parameter p, then mean(b) = p and V ar(b) = p(1 p). We know the sample rates are p 1 = 0.81 and p 2 = Can we compute a 95% confidence interval for π 1 = π 2? The rate of success p 1 = , where we put 1 and 0 for Y ES and NO responses respectively, for 402 male participants. Therefore, we can think of p 1 and p 2 as random sample means. By the Central Limit Theorem, we can assume they are sampled from two normal random variables p 1 and p 2, that are distributed normally. More precisely, p 1 N(π 1, σ1) 2 p 2 N(π 2, σ2), 2 (16) where σ1 2 = π1(1 π1) 625 and σ2 2 = π2(1 π2) 502. Since we don t have π 1 and π 2, we use p 1 and p 2 as an approximation and for π 1 and π 2 and write s 2 1 and s 2 2 instead of σ1 2 and σ2. 2 Therefore we have > p1=0.81;p2<-0.79 > (s1<-p1*(1-p1)/625) [1] > (s2<-p2*(1-p2)/502) [1] Therefore p 1 N(0.81, ) p 2 N(0.79, ). (17) Now, let s discuss p 1 p 2. But first a theorem! In the following theorem pay attention that the variance is always the sum of the variances. Theorem 2. If X 1 and X 2 are independent normal random variables with parameters mean and variance (m 1, σ 2 1) and (m 2, σ 2 2), then X 1 ± X 2 is normal with parameters (m 1 ± m 2, σ σ 2 2) Therefore, p 1 p 2 is a normal distribution with mean = =.02 and Therefore, the 95% confidence interval is SE = = (18) [ , ] = [ 0.028, 0.068] We can also perform hypothesis testing. Assume we want to check if H 0 : π 1 = π 2 vs H 1 : π 1 π 2. 8

9 Remark 4. π 1 = π 2 means P(R = Y F ) = P(R = Y M). Then P(Y, F )P(M) = P(Y, M)P(F ). Therefore, P(Y, F ) P(Y, F )P(F ) = P(Y, M)P(F ), which implies that P(Y,F ) P(F ) = P(Y, M) + P(Y, F ). Therefore P(Y F ) = P(F ), which implies that P(N F ) = P(N). Similarly, we can check that P(Y M) = P(Y ), which implies that P(N Y ) = P(N). Therefore, we have Response is independent of Gender. Remember how we approximated σ1 2 and σ2 2 in (16) by s 2 1 and s 2 2. In hypothesis testing, under the null hypothesis π 1 = π 2 we can do a better job, thanks to Remark 4. The pooled variance is a common variance σ 2, closely related to the between variation in AOV, replacing both σ1 2 and σ2. 2 If π 1 = π 2, then as mentioned in Remark 4, Response is independent of Gender. Therefore, two samples are from the same populations. Therefore, σ1 2 = σ2 2 = σ 2. And σ 2 is the average of variances based on the samples we have. It is calculated as, (n 1 1)s 2 1 s p = + (n 2 1)s 2 2 = , n 1 + n 2 2 compare with (18). If π 1 = π 2, then p 1 p 2 N(0, ). The number we sampled is p 1 p 2 =.02. What is the chance to with N(0, ) sample a number.02 or further distance from the center 0? > 2*pnorm(q = -.02,mean = 0,sd = ,lower.tail = T) [1] Therefore, we cannot reject the null hypothesis. 5 Odds and Odds Ratio If π is the rate of success in a binomial trial, then its corresponding odds is defined to be odds = π 1 π. (19) If odds = 4, then success is 4 times as likely as a failure. We expect to see, on average, 4 successes for each failure. We of course can retrieve π if we know its corresponding odds by π = odds/(odds + 1). Every 2 2 contingency table induces two rates of success π 1 and π 2 corresponding to its rows. Let odds 1 and odds 2 be the odds corresponding to π 1 and π 2. By dividing the odds 1 by odds 2 we find another measure of association between the rows. This measure, denoted by θ, is called the odds ratio and is defined by θ = odds1 odds2 = π 1/(1 π 1 ) π 2 /(1 π 2 ). (20) Odds ratios are positive numbers in interval θ (0, ). θ = 4 means the odds of group in the first row is 4 times the odds of the group in second row. θ = 1 4 means the opposite is true, i.e., the odds of group in the second row is 4 times the odds of the group in first row. θ = 1 means the odds are equal, which implies that π 1 = π 2. In general θ > 1 implies π 1 > π 2, θ = 1, implies π 1 = π 2 and θ < 1 implies that π 1 < π 2. Furthermore, for any positive number α > 0, θ = α and θ = 1/α are convey opposite implications about odds of the 2 groups. As we always do in statistics, we have only the sample odds, which is defined by ˆθ = p 1/(1 p 1 ) p 2 /(1 p 2 ) Consider two population with equal odds. Then the sampling odds ratio will be around 1. You can see that the left tail is in (0, 1) and right tail in (1, ). Therefore, the sampling distribution of the odds ratio is highly skewed. But if we consider log θ instead of θ, then we will have nicer, and more intuitive properties. for example 9

10 log θ = 0 (i.e θ = 1) implies π 1 = π 2. log θ = 2 and log θ = 2 are symmetric around 0 and convey opposite statement about π 1 and π 2. The sample log odds ratio, log ˆθ, has a less skewed sampling distribution that is bell-shaped with standard deviation given by 1 SE = , (21) n 11 n 12 n 21 n 22 where n ij are the counts in the contingency table. Example 3. In our contingency table of the afterlife belief, compute log ˆθ and a 95% confidence interval for the log θ. Solution. Since the sample logˆθ and standard deviation are > odds.f< /( ) > odds.m< /( ) > (p<-log(odds.f/odds.m)) [1] > (SE=sqrt((1/509)+(1/116)+(1/398)+(1/104))) [1] Then the lower and upper limits of the 95% CI are > (lower<-p-1.96*se) [1] > (upper<-p+1.96*se) [1] Since zero is included in the interval then log θ = 0 is a possibility. Therefore, π 1 = π 2 is a possibility with 95% chance. By exponentiating, we find that the 95% CI for θ is [ , ] 5.1 Contingency Tables and Chi-Square test A 2000 General Social Survey, cross classifies 2757 subjects based on gender and their political party as below This table defines a sample joint probability p = {p 11, p 12, p 13, p 21, p 22, p 23 } that is Democrat Independent Republican Female Male Democrat Independent Republican Female Male Of course p is random as it is defined by a random sample. Does there enough evidence there to reject H 0 defined by H 0 : π = {0.25, 0.1, 0.25, 0.15, 0.1, 0.15}? 10

11 Democrat Independent Republican Female Male Solution. The expected number for each cell µ = µ ij, based on π = π ij is obtained by µ = π So we have There are 6 residuals, which are the difference between the expected(fitted) value and sample(actual) value. The residual squares are > sample<-c(762,327,468,484,239,477) > expected<-c(689.25,275.70,689.25,413.55,275.70,413.55) > res.sq<-c(sample-expected)^2 Bigger residuals are stronger evidences against H Chi-squared Distribution Democrat Independent Republican Female Male Chi-squared distribution also denoted by χ 2 distribution with k degrees of freedom, is the sum of square of k independent standard normal distributions. Think of residuals in a contingency table. They are approximately normal and after dividing by their standard deviation they become standard normal. Definition 3 (Wikipedia). If Z 1,, Z k are independent, standard normal random variables, then the sum of their squares, k Q = Zi 2 (22) i=1 is distributed according to the chi-squared distribution with k degrees of freedom. This is usually denoted as Q χ 2 k. Furthermore, if X χ 2 k, then EX = k and V arx = 2k. 11

12 The graph of the densities above shows how it become closer to a normal density as degrees of freedom increases. In the discussion that comes next, we will talk about the mean and variance of the frequencies of each cell, not to be confused with the mean and variance of Q. 5.3 Simulating The Contingency Table Assume we know the population probabilities of each cell: π = {π 11, π 12, π 13, π 21, π 22, π 23 }. Then the counts of these 6 cells follow a multinomial distribution. For example with 1000 people we might get > set.seed(1001) > pi<-c(.1,.3,.2,.1,.2,.1) > r=14 > N=1000 > (sample<-rmultinom(n = r,size = N,prob = pi)) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [1,] [2,] [3,] [4,] [5,] [6,] Each column is a random sample for the 6 cells of the contingency table and each row are 10 samples for one of the 6 cells. Check that each cell looks normal with mean = N π and sd = N π(1 π) > (mean<-1000*pi[1]) [1] 100 > (sd=sqrt(1000*pi[1]*(1-pi[1]))) [1] > hist(sample[1,]) Histogram of sample[1, ] Frequency sample[1, ] 12

13 So, each cell cell i,j is a binomial with parameter π ij, with mean equal to Nπ ij and variance Nπ ij (1 π ij ), see Theorem 3. Therefore, by CLT, (o Nπ ij )/ Nπ ij (1 π ij ) N(0, 1). 5.4 A Two-Cell Model Let s see what this yields when there are only two cells, one row and 2 columns, i.e, j = 1 and i = 1, 2. Let π 1 and π 2 denote probabilities of cell 1 and cell 2: π 1 + π 2 = 1. Furthermore, let 1. E i = Nπ i denote the mean (Expected value) of the cell i. 2. O i the observation in cell i. Therefore, O 1 + O 2 = N. 3. Define Q i = Oi Nπi. Nπi(1 π i) By CLT, Q 1 is sampled from an approximately normal distribution, see Theorem 3. And Q 2 1 is Q 2 1 = (O 1 E 1 ) 2 Nπ(1 π) = (O 1 Nπ) 2 Nπ after some algebraic manipulations + (O 2 N(1 π)) 2 N(1 π) = (O 1 E 1 ) 2 E 1 + (O 2 E 2 ) 2 E 2. Therefore, in a two cell model, if we compute (Oi Ei)2 E i for each cell and add them up, the result is a χ 2 1. Theorem 3. If X is a binomial random variable with parameters N and π, where N is the number of trials and π the rate of success, then EX = Nπ sdx = Nπ(1 π) (23) 5.5 Six-cell Model Same holds when there are 6 cells. We want to test the null hypothesis where H 0 and H 1 are given by H 0 : π = (π 1,, π 6 ), H 1 : π is not as given by H 0 (24) and we have O 1,, O 6 observations. Furthermore the total number of observations N = O O 6 can be computed. 1. Compute E i = Nπ i, for i = 1,, 6 2. Compute (Oi Ei)2 E i for i = 1,, 6, 3. Compute the test statistic Q = 6 (O i E i) 2 i=1 E i Q is a number, which under H 0 is sampled from χ 2 5 : Q χ 2 5. Check the table to see what is the chance p v al value that one would sample Q or bigger from χ 2 5. If p v al <.05, you reject H Example Continued, Goodness of the Fit For our cross classification of the gender-political party, we computed all the summands. If we add them up we get > (chi.sq<-sum(res.sq/expected)) [1] This number is sampled from χ 2 5. What is the chance that χ 2 5 generates or bigger? 13

14 > (p_val<-pchisq(q = chi.sq,df = 5,lower.tail = FALSE)) [1] e-23 And we reject H 0. This is an example of checking the goodness of the fit using chi-squared test. We are given observation and our model is basically defined by π 1,, π 6, or same as π ij, i = 1,, 3 j = 1, 2. In this case we concluded that our model defined by H 0 is not appropriate. Example 4. Assume we are given 20 numbers and we want to see if it is acceptable to assume they are sampled from a normal distribution. Let s assume the numbers are [1] [20] 120 > hist(t) Histogram of t Frequency t Solution. If the numbers are from a normal distribution, then the mean and variance would be > (m<-mean(t)) [1] > (sd<-sd(t)) [1] The range starts from 88 and ends with biggest number120. Let s make 3 cells. Cell 2 for all observations within one standard deviation of the mean, i.e., all observations in interval Cell 2 = [ , ]. Cell 3 = ( , ) and Cell 1 = (, ]. If numbers are from N(102, 10), then we can find probability of each cell. Actually we know from the picture below that > pi<-c(.16,.68,.16) > o1<-sum(t<(m-sd));o3<-sum(t>(m+sd));o2<-(length(t)-(o1+o3)) > (ob<-c(o1,o2,o3)) 14

15 [1] > (E<-pi*20) [1] > (res<-((ob-e)^2)/e) [1] > (chi.sq<-sum(res)) [1] There are 3 cells, therefore, there are 2 degrees of freedom. Is χ 2 2 =.588 too big? > (p_val<-pchisq(q = chi.sq,df = 2,lower.tail = FALSE)) [1] No, we cannot reject the possibility that numbers are sampled from N(102, 10). 5.7 Test of Independence We dealt with independence at Example 2. What is different here? In Example 2 we had access to the entire population (the jar of the balls) and could compute π ij for each cell. Here, we have a sample, and need a more powerful theory to infer about the population probabilities. We cannot apply the definition of independence to the probabilities p derived from the contingency table, as they are estimates of π ij at best and fluctuate Structure of H 0 In the χ 2 test of independence, H 0 is different than H 0 for goodness of the fit as in (24). Here, instead of joint probabilities π ij, we are given the observations. Then we can add up observations in columns and rows to compute the marginals, π i+ and π +j, see (12) Degrees of Freedom Furthermore, when testing the goodness of the fit, the only restrain on joint probabilities π ij is that π ij = 1, (25) ij therefore, there are I J 1 degrees of freedom. When testing independence marginals are computed from the observations. Using marginals, we compute joint probabilities under the independence condition, see (14). Therefore, instead of ij π ij = 1, joint probabilities in each row and each column should add up to the first marginals and the second marginals. Therefore, the degrees of freedom are (I 1) (J 1) How It Works Assume there are I rows and J columns. Therefore, i ranges in 1,..., I and j in 1,, J. We are given observations O ij for all i, j. Therefore, we can compute the sample marginals p i+ and p +j. Use the sample marginal as an approximation for π i+ and π +j. Use π i+ and π +j to compute the joint probabilities π ij under the independence condition. Then We want to test the hypothesis that. Finally, test the hypothesis that the observations is consistent with the joint probabilities H 0 : O ij is sampled from π ij for all i, j H 1 : O ij is not sampled from π ij at least for one i, j 15

16 Solution. We discuss below the procedure step-by-step. Remember that we are only given the observations O ij 1. Let O = ij O ij denote the total number of observations. 2. Add observations in each row and divide by O to obtain row marginals π 1+,, π I+. 3. Add observations in each column and divide by O to obtain column marginals π +1,, π +J. 4. Under independence, we can compute the joint probabilities π ij = π i+ π +j. 5. Compute all the expected observations E ij = π ij O. 6. Compute (Oij Eij)2 E ij for all ij. 7. Compute the statistic test: Q = ij (O ij E ij ) 2 E ij (26) 8. Compute p value p v al, the right tail of χ 2 df (J 1). that is bigger than Q, using degrees of freedom (I 1) Let me sample from from the jar in Example 2. The sample is with replacement, so size of the sample could be bigger than 100. Test the hypothesis that H 0 : Color of the balls is independent of its type. Blue Black Red Glass Wood To be able to use R, we store these numbers in a matrix: > (a<-matrix(data = c(2,6,8,7,14,13),nrow = 2,byrow = TRUE, + dimnames = list(c("glass","wood"),c("blue","black","red")))) Blue Black Red Glass Wood Total number of observation is : O = = First marginal is m.1 = [( )/50, ( )/50]: > (m.1<-apply(x = a,margin = 1,FUN = sum)/50) Glass Wood second marginal: > (m.2<-apply(x = a,margin = 2,FUN = sum)/50) Blue Black Red Joint probabilities: 16

17 Blue Black Red Glass Wood Expected observations > (E<-jp*50) Blue Black Red Glass Wood Compute residuals squared divided by expected value > (R<-((E-a)^2)/E) Blue Black Red Glass Wood Compute Q: > (chi.sq<-sum(r)) [1] degrees of freedom is (2 1) (3 1) = 2 9. Compute p value > (p_val<-pchisq(q = ,df = 2,lower.tail = FALSE)) [1] We cannot reject independence OR YOU CAN SIMPLY FEED YOUR DATA TO THE FOLLOWING COMMAND IN R > chisq.test(a) Pearson's Chi-squared test data: a X-squared = , df = 2, p-value = ROC curve and AUC Assume an endemic affected 10% of a population. We designed a classifiers that generate probabilities p of being in positive class (diseased). We gather two populations of 500 healthy and 50 diseased patients and look at he distribution of scores that the classifier spits out for each group. Assume we get the following results > H.score<-rnorm(500,.3,.15) > S.score<-rnorm(50,.7,.1) Lets look at the overlapping distributions of the two groups of scores in a plot. The vertical line is a threshold which indicates the prediction rule, scores on the right (bigger than the threshold) are predicted sick and scores on the left are predicted healthy. Then 1. Red on the left of the vertical line means : TRUE NEGATIVE 17

18 2. Red on the right of the vertical line means: FALSE POSITIVE 3. Green on the right of the vertical line means: TRUE POSITIVE 4. Green on the left of the vertical line means: FALSE NEGATIVE We can place the vertical line at any x (0, 1) and compute the corresponding true positive rate and false positive rate. The plot in false positive-true positive space is a curve. Assume from the data we know the positive class and negative class (this is not based on prediction, but based on the label of the data). For example H.score has 500 members. Therefore we know that TN+FP=500. Similarly, there are 50 sick people which means TP+FN=50. Before we proceed further to compute the rates, let s redefine S.score and H.score. > library(ggplot2) > dat<-data.frame(dens = c(h.score,s.score),lines=rep(c("healthy", "Sick"), c(500,50))) > ggplot(dat, aes(dens, fill = lines)) + geom_histogram(position = "dodge")+ + geom_vline(xintercept =.6) 40 count lines Healthy Sick dens 18

19 > H.score<-rnorm(500,.4,.2) > S.score<-rnorm(50,.6,.2) Therefore for a given threshold, say.6, we have > threshold=.6 > lp<-length(pclass<-s.score) > ln<-length(nclass<-h.score) > c(tpr=sum(pclass>threshold)/lp,fpr=sum(nclass>threshold)/ln) tpr fpr where tpr and fpr stand for true positive rate and false positive rate respectively. We can compute tpr and fpr for different thresholds > ts<-seq(from =.01,to=.99,by=.01) > K<-unname(unlist(lapply(X = ts,fun = function(threshold) + c(tpr=sum(pclass>threshold)/length(pclass), + fpr=sum(nclass>threshold)/length(nclass))))) Then we can separate tpr from fpr > tpr<-k[seq(from=1,to = 197,by = 2)] > fpr<-k[seq(from=2,to = 198,by = 2)] Then we can plot it > plot(fpr,tpr,type = 'l') tpr fpr The curve produced this way is called ROC curve and the area under the curve equals to the probability that 19

20 the classifier will rank a randomly chosen positive example higher than a randomly chosen negative example.[from This fact allows us to compute the area under the curve by sampling and counting > p = replicate(50000, sample(pclass, size=1) > sample(nclass, size=1)) > (mean(p)) [1] Let s repeat the process with a modified classifier which gives different (better) scores > H.score<-rnorm(500,.4,.15)->nclass > S.score<-rnorm(50,.6,.12)->pclass > ts<-seq(from =.01,to=.99,by=.01) > K<-unname(unlist(lapply(X = ts,fun = function(threshold) + c(tpr=sum(pclass>threshold)/length(pclass), + fpr=sum(nclass>threshold)/length(nclass))))) Then we can separate tpr from fpr > tpr<-k[seq(from=1,to = 197,by = 2)] > fpr<-k[seq(from=2,to = 198,by = 2)] Then we can plot it > plot(fpr,tpr,type = 'l') tpr fpr 20

21 > p = replicate(50000, sample(pclass, size=1) > sample(nclass, size=1)) > (mean(p)) [1] You can play with the distributions of pclass and nclass to repeat the procedure above to see that as scores become more concentrated and more separate the area under the curve increases. 21

ST3241 Categorical Data Analysis I Two-way Contingency Tables. 2 2 Tables, Relative Risks and Odds Ratios

ST3241 Categorical Data Analysis I Two-way Contingency Tables. 2 2 Tables, Relative Risks and Odds Ratios ST3241 Categorical Data Analysis I Two-way Contingency Tables 2 2 Tables, Relative Risks and Odds Ratios 1 What Is A Contingency Table (p.16) Suppose X and Y are two categorical variables X has I categories

More information

Introduction to Statistical Data Analysis Lecture 7: The Chi-Square Distribution

Introduction to Statistical Data Analysis Lecture 7: The Chi-Square Distribution Introduction to Statistical Data Analysis Lecture 7: The Chi-Square Distribution James V. Lambers Department of Mathematics The University of Southern Mississippi James V. Lambers Statistical Data Analysis

More information

Probability and Statistics. Terms and concepts

Probability and Statistics. Terms and concepts Probability and Statistics Joyeeta Dutta Moscato June 30, 2014 Terms and concepts Sample vs population Central tendency: Mean, median, mode Variance, standard deviation Normal distribution Cumulative distribution

More information

Performance evaluation of binary classifiers

Performance evaluation of binary classifiers Performance evaluation of binary classifiers Kevin P. Murphy Last updated October 10, 2007 1 ROC curves We frequently design systems to detect events of interest, such as diseases in patients, faces in

More information

n y π y (1 π) n y +ylogπ +(n y)log(1 π).

n y π y (1 π) n y +ylogπ +(n y)log(1 π). Tests for a binomial probability π Let Y bin(n,π). The likelihood is L(π) = n y π y (1 π) n y and the log-likelihood is L(π) = log n y +ylogπ +(n y)log(1 π). So L (π) = y π n y 1 π. 1 Solving for π gives

More information

Probability and Statistics. Joyeeta Dutta-Moscato June 29, 2015

Probability and Statistics. Joyeeta Dutta-Moscato June 29, 2015 Probability and Statistics Joyeeta Dutta-Moscato June 29, 2015 Terms and concepts Sample vs population Central tendency: Mean, median, mode Variance, standard deviation Normal distribution Cumulative distribution

More information

Performance Evaluation and Hypothesis Testing

Performance Evaluation and Hypothesis Testing Performance Evaluation and Hypothesis Testing 1 Motivation Evaluating the performance of learning systems is important because: Learning systems are usually designed to predict the class of future unlabeled

More information

Performance Evaluation and Comparison

Performance Evaluation and Comparison Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Cross Validation and Resampling 3 Interval Estimation

More information

Statistics 3858 : Contingency Tables

Statistics 3858 : Contingency Tables Statistics 3858 : Contingency Tables 1 Introduction Before proceeding with this topic the student should review generalized likelihood ratios ΛX) for multinomial distributions, its relation to Pearson

More information

Lecture 14: Introduction to Poisson Regression

Lecture 14: Introduction to Poisson Regression Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu 8 May 2007 1 / 52 Overview Modelling counts Contingency tables Poisson regression models 2 / 52 Modelling counts I Why

More information

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview Modelling counts I Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu Why count data? Number of traffic accidents per day Mortality counts in a given neighborhood, per week

More information

Evaluation requires to define performance measures to be optimized

Evaluation requires to define performance measures to be optimized Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation

More information

13.1 Categorical Data and the Multinomial Experiment

13.1 Categorical Data and the Multinomial Experiment Chapter 13 Categorical Data Analysis 13.1 Categorical Data and the Multinomial Experiment Recall Variable: (numerical) variable (i.e. # of students, temperature, height,). (non-numerical, categorical)

More information

Introduction to Supervised Learning. Performance Evaluation

Introduction to Supervised Learning. Performance Evaluation Introduction to Supervised Learning Performance Evaluation Marcelo S. Lauretto Escola de Artes, Ciências e Humanidades, Universidade de São Paulo marcelolauretto@usp.br Lima - Peru Performance Evaluation

More information

Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluation. Andrea Passerini Machine Learning. Evaluation Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain

More information

Lecture 01: Introduction

Lecture 01: Introduction Lecture 01: Introduction Dipankar Bandyopadhyay, Ph.D. BMTRY 711: Analysis of Categorical Data Spring 2011 Division of Biostatistics and Epidemiology Medical University of South Carolina Lecture 01: Introduction

More information

STAT 705: Analysis of Contingency Tables

STAT 705: Analysis of Contingency Tables STAT 705: Analysis of Contingency Tables Timothy Hanson Department of Statistics, University of South Carolina Stat 705: Analysis of Contingency Tables 1 / 45 Outline of Part I: models and parameters Basic

More information

BIOS 625 Fall 2015 Homework Set 3 Solutions

BIOS 625 Fall 2015 Homework Set 3 Solutions BIOS 65 Fall 015 Homework Set 3 Solutions 1. Agresti.0 Table.1 is from an early study on the death penalty in Florida. Analyze these data and show that Simpson s Paradox occurs. Death Penalty Victim's

More information

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur Fundamentals to Biostatistics Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur Statistics collection, analysis, interpretation of data development of new

More information

STAC51: Categorical data Analysis

STAC51: Categorical data Analysis STAC51: Categorical data Analysis Mahinda Samarakoon January 26, 2016 Mahinda Samarakoon STAC51: Categorical data Analysis 1 / 32 Table of contents Contingency Tables 1 Contingency Tables Mahinda Samarakoon

More information

Chapter 26: Comparing Counts (Chi Square)

Chapter 26: Comparing Counts (Chi Square) Chapter 6: Comparing Counts (Chi Square) We ve seen that you can turn a qualitative variable into a quantitative one (by counting the number of successes and failures), but that s a compromise it forces

More information

A.I. in health informatics lecture 2 clinical reasoning & probabilistic inference, I. kevin small & byron wallace

A.I. in health informatics lecture 2 clinical reasoning & probabilistic inference, I. kevin small & byron wallace A.I. in health informatics lecture 2 clinical reasoning & probabilistic inference, I kevin small & byron wallace today a review of probability random variables, maximum likelihood, etc. crucial for clinical

More information

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing 1 In most statistics problems, we assume that the data have been generated from some unknown probability distribution. We desire

More information

Discrete Multivariate Statistics

Discrete Multivariate Statistics Discrete Multivariate Statistics Univariate Discrete Random variables Let X be a discrete random variable which, in this module, will be assumed to take a finite number of t different values which are

More information

STAT 4385 Topic 01: Introduction & Review

STAT 4385 Topic 01: Introduction & Review STAT 4385 Topic 01: Introduction & Review Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso xsu@utep.edu Spring, 2016 Outline Welcome What is Regression Analysis? Basics

More information

How do we compare the relative performance among competing models?

How do we compare the relative performance among competing models? How do we compare the relative performance among competing models? 1 Comparing Data Mining Methods Frequent problem: we want to know which of the two learning techniques is better How to reliably say Model

More information

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F). STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis 1. Indicate whether each of the following is true (T) or false (F). (a) T In 2 2 tables, statistical independence is equivalent to a population

More information

Testing Independence

Testing Independence Testing Independence Dipankar Bandyopadhyay Department of Biostatistics, Virginia Commonwealth University BIOS 625: Categorical Data & GLM 1/50 Testing Independence Previously, we looked at RR = OR = 1

More information

Sociology 6Z03 Review II

Sociology 6Z03 Review II Sociology 6Z03 Review II John Fox McMaster University Fall 2016 John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 1 / 35 Outline: Review II Probability Part I Sampling Distributions Probability

More information

ST3241 Categorical Data Analysis I Two-way Contingency Tables. Odds Ratio and Tests of Independence

ST3241 Categorical Data Analysis I Two-way Contingency Tables. Odds Ratio and Tests of Independence ST3241 Categorical Data Analysis I Two-way Contingency Tables Odds Ratio and Tests of Independence 1 Inference For Odds Ratio (p. 24) For small to moderate sample size, the distribution of sample odds

More information

Chapter 2: Describing Contingency Tables - I

Chapter 2: Describing Contingency Tables - I : Describing Contingency Tables - I Dipankar Bandyopadhyay Department of Biostatistics, Virginia Commonwealth University BIOS 625: Categorical Data & GLM [Acknowledgements to Tim Hanson and Haitao Chu]

More information

STAT Chapter 13: Categorical Data. Recall we have studied binomial data, in which each trial falls into one of 2 categories (success/failure).

STAT Chapter 13: Categorical Data. Recall we have studied binomial data, in which each trial falls into one of 2 categories (success/failure). STAT 515 -- Chapter 13: Categorical Data Recall we have studied binomial data, in which each trial falls into one of 2 categories (success/failure). Many studies allow for more than 2 categories. Example

More information

Business Statistics. Lecture 10: Course Review

Business Statistics. Lecture 10: Course Review Business Statistics Lecture 10: Course Review 1 Descriptive Statistics for Continuous Data Numerical Summaries Location: mean, median Spread or variability: variance, standard deviation, range, percentiles,

More information

MA : Introductory Probability

MA : Introductory Probability MA 320-001: Introductory Probability David Murrugarra Department of Mathematics, University of Kentucky http://www.math.uky.edu/~dmu228/ma320/ Spring 2017 David Murrugarra (University of Kentucky) MA 320:

More information

The Chi-Square Distributions

The Chi-Square Distributions MATH 183 The Chi-Square Distributions Dr. Neal, WKU The chi-square distributions can be used in statistics to analyze the standard deviation σ of a normally distributed measurement and to test the goodness

More information

The Multinomial Model

The Multinomial Model The Multinomial Model STA 312: Fall 2012 Contents 1 Multinomial Coefficients 1 2 Multinomial Distribution 2 3 Estimation 4 4 Hypothesis tests 8 5 Power 17 1 Multinomial Coefficients Multinomial coefficient

More information

Quantitative Analysis and Empirical Methods

Quantitative Analysis and Empirical Methods Hypothesis testing Sciences Po, Paris, CEE / LIEPP Introduction Hypotheses Procedure of hypothesis testing Two-tailed and one-tailed tests Statistical tests with categorical variables A hypothesis A testable

More information

Least Squares Classification

Least Squares Classification Least Squares Classification Stephen Boyd EE103 Stanford University November 4, 2017 Outline Classification Least squares classification Multi-class classifiers Classification 2 Classification data fitting

More information

Cohen s s Kappa and Log-linear Models

Cohen s s Kappa and Log-linear Models Cohen s s Kappa and Log-linear Models HRP 261 03/03/03 10-11 11 am 1. Cohen s Kappa Actual agreement = sum of the proportions found on the diagonals. π ii Cohen: Compare the actual agreement with the chance

More information

Q1 (12 points): Chap 4 Exercise 3 (a) to (f) (2 points each)

Q1 (12 points): Chap 4 Exercise 3 (a) to (f) (2 points each) Q1 (1 points): Chap 4 Exercise 3 (a) to (f) ( points each) Given a table Table 1 Dataset for Exercise 3 Instance a 1 a a 3 Target Class 1 T T 1.0 + T T 6.0 + 3 T F 5.0-4 F F 4.0 + 5 F T 7.0-6 F T 3.0-7

More information

Smart Home Health Analytics Information Systems University of Maryland Baltimore County

Smart Home Health Analytics Information Systems University of Maryland Baltimore County Smart Home Health Analytics Information Systems University of Maryland Baltimore County 1 IEEE Expert, October 1996 2 Given sample S from all possible examples D Learner L learns hypothesis h based on

More information

STAT:5100 (22S:193) Statistical Inference I

STAT:5100 (22S:193) Statistical Inference I STAT:5100 (22S:193) Statistical Inference I Week 3 Luke Tierney University of Iowa Fall 2015 Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 1 Recap Matching problem Generalized

More information

Epidemiology Wonders of Biostatistics Chapter 11 (continued) - probability in a single population. John Koval

Epidemiology Wonders of Biostatistics Chapter 11 (continued) - probability in a single population. John Koval Epidemiology 9509 Wonders of Biostatistics Chapter 11 (continued) - probability in a single population John Koval Department of Epidemiology and Biostatistics University of Western Ontario What is being

More information

Performance Evaluation

Performance Evaluation Performance Evaluation Confusion Matrix: Detected Positive Negative Actual Positive A: True Positive B: False Negative Negative C: False Positive D: True Negative Recall or Sensitivity or True Positive

More information

Unit 9: Inferences for Proportions and Count Data

Unit 9: Inferences for Proportions and Count Data Unit 9: Inferences for Proportions and Count Data Statistics 571: Statistical Methods Ramón V. León 1/15/008 Unit 9 - Stat 571 - Ramón V. León 1 Large Sample Confidence Interval for Proportion ( pˆ p)

More information

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F). STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis 1. Indicate whether each of the following is true (T) or false (F). (a) (b) (c) (d) (e) In 2 2 tables, statistical independence is equivalent

More information

15: CHI SQUARED TESTS

15: CHI SQUARED TESTS 15: CHI SQUARED ESS MULIPLE CHOICE QUESIONS In the following multiple choice questions, please circle the correct answer. 1. Which statistical technique is appropriate when we describe a single population

More information

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology

More information

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3 STA 303 H1S / 1002 HS Winter 2011 Test March 7, 2011 LAST NAME: FIRST NAME: STUDENT NUMBER: ENROLLED IN: (circle one) STA 303 STA 1002 INSTRUCTIONS: Time: 90 minutes Aids allowed: calculator. Some formulae

More information

Binary Logistic Regression

Binary Logistic Regression The coefficients of the multiple regression model are estimated using sample data with k independent variables Estimated (or predicted) value of Y Estimated intercept Estimated slope coefficients Ŷ = b

More information

Categorical Variables and Contingency Tables: Description and Inference

Categorical Variables and Contingency Tables: Description and Inference Categorical Variables and Contingency Tables: Description and Inference STAT 526 Professor Olga Vitek March 3, 2011 Reading: Agresti Ch. 1, 2 and 3 Faraway Ch. 4 3 Univariate Binomial and Multinomial Measurements

More information

Evaluation & Credibility Issues

Evaluation & Credibility Issues Evaluation & Credibility Issues What measure should we use? accuracy might not be enough. How reliable are the predicted results? How much should we believe in what was learned? Error on the training data

More information

Categorical Data Analysis Chapter 3

Categorical Data Analysis Chapter 3 Categorical Data Analysis Chapter 3 The actual coverage probability is usually a bit higher than the nominal level. Confidence intervals for association parameteres Consider the odds ratio in the 2x2 table,

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

More information

UNIVERSITY OF TORONTO Faculty of Arts and Science

UNIVERSITY OF TORONTO Faculty of Arts and Science UNIVERSITY OF TORONTO Faculty of Arts and Science December 2013 Final Examination STA442H1F/2101HF Methods of Applied Statistics Jerry Brunner Duration - 3 hours Aids: Calculator Model(s): Any calculator

More information

PubH 5450 Biostatistics I Prof. Carlin. Lecture 13

PubH 5450 Biostatistics I Prof. Carlin. Lecture 13 PubH 5450 Biostatistics I Prof. Carlin Lecture 13 Outline Outline Sample Size Counts, Rates and Proportions Part I Sample Size Type I Error and Power Type I error rate: probability of rejecting the null

More information

The Naïve Bayes Classifier. Machine Learning Fall 2017

The Naïve Bayes Classifier. Machine Learning Fall 2017 The Naïve Bayes Classifier Machine Learning Fall 2017 1 Today s lecture The naïve Bayes Classifier Learning the naïve Bayes Classifier Practical concerns 2 Today s lecture The naïve Bayes Classifier Learning

More information

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

 M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2 Notation and Equations for Final Exam Symbol Definition X The variable we measure in a scientific study n The size of the sample N The size of the population M The mean of the sample µ The mean of the

More information

Interpret Standard Deviation. Outlier Rule. Describe the Distribution OR Compare the Distributions. Linear Transformations SOCS. Interpret a z score

Interpret Standard Deviation. Outlier Rule. Describe the Distribution OR Compare the Distributions. Linear Transformations SOCS. Interpret a z score Interpret Standard Deviation Outlier Rule Linear Transformations Describe the Distribution OR Compare the Distributions SOCS Using Normalcdf and Invnorm (Calculator Tips) Interpret a z score What is an

More information

Lecture 8: Summary Measures

Lecture 8: Summary Measures Lecture 8: Summary Measures Dipankar Bandyopadhyay, Ph.D. BMTRY 711: Analysis of Categorical Data Spring 2011 Division of Biostatistics and Epidemiology Medical University of South Carolina Lecture 8:

More information

Section 4.6 Simple Linear Regression

Section 4.6 Simple Linear Regression Section 4.6 Simple Linear Regression Objectives ˆ Basic philosophy of SLR and the regression assumptions ˆ Point & interval estimation of the model parameters, and how to make predictions ˆ Point and interval

More information

STAT 135 Lab 11 Tests for Categorical Data (Fisher s Exact test, χ 2 tests for Homogeneity and Independence) and Linear Regression

STAT 135 Lab 11 Tests for Categorical Data (Fisher s Exact test, χ 2 tests for Homogeneity and Independence) and Linear Regression STAT 135 Lab 11 Tests for Categorical Data (Fisher s Exact test, χ 2 tests for Homogeneity and Independence) and Linear Regression Rebecca Barter April 20, 2015 Fisher s Exact Test Fisher s Exact Test

More information

Glossary for the Triola Statistics Series

Glossary for the Triola Statistics Series Glossary for the Triola Statistics Series Absolute deviation The measure of variation equal to the sum of the deviations of each value from the mean, divided by the number of values Acceptance sampling

More information

Log-linear Models for Contingency Tables

Log-linear Models for Contingency Tables Log-linear Models for Contingency Tables Statistics 149 Spring 2006 Copyright 2006 by Mark E. Irwin Log-linear Models for Two-way Contingency Tables Example: Business Administration Majors and Gender A

More information

Probability Theory and Applications

Probability Theory and Applications Probability Theory and Applications Videos of the topics covered in this manual are available at the following links: Lesson 4 Probability I http://faculty.citadel.edu/silver/ba205/online course/lesson

More information

Lecture 5: ANOVA and Correlation

Lecture 5: ANOVA and Correlation Lecture 5: ANOVA and Correlation Ani Manichaikul amanicha@jhsph.edu 23 April 2007 1 / 62 Comparing Multiple Groups Continous data: comparing means Analysis of variance Binary data: comparing proportions

More information

Chapter 10. Discrete Data Analysis

Chapter 10. Discrete Data Analysis Chapter 1. Discrete Data Analysis 1.1 Inferences on a Population Proportion 1. Comparing Two Population Proportions 1.3 Goodness of Fit Tests for One-Way Contingency Tables 1.4 Testing for Independence

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

Inference for Binomial Parameters

Inference for Binomial Parameters Inference for Binomial Parameters Dipankar Bandyopadhyay, Ph.D. Department of Biostatistics, Virginia Commonwealth University D. Bandyopadhyay (VCU) BIOS 625: Categorical Data & GLM 1 / 58 Inference for

More information

Two-sample Categorical data: Testing

Two-sample Categorical data: Testing Two-sample Categorical data: Testing Patrick Breheny April 1 Patrick Breheny Introduction to Biostatistics (171:161) 1/28 Separate vs. paired samples Despite the fact that paired samples usually offer

More information

16.400/453J Human Factors Engineering. Design of Experiments II

16.400/453J Human Factors Engineering. Design of Experiments II J Human Factors Engineering Design of Experiments II Review Experiment Design and Descriptive Statistics Research question, independent and dependent variables, histograms, box plots, etc. Inferential

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

An introduction to biostatistics: part 1

An introduction to biostatistics: part 1 An introduction to biostatistics: part 1 Cavan Reilly September 6, 2017 Table of contents Introduction to data analysis Uncertainty Probability Conditional probability Random variables Discrete random

More information

Review of Statistics

Review of Statistics Review of Statistics Topics Descriptive Statistics Mean, Variance Probability Union event, joint event Random Variables Discrete and Continuous Distributions, Moments Two Random Variables Covariance and

More information

Performance Evaluation

Performance Evaluation Performance Evaluation David S. Rosenberg Bloomberg ML EDU October 26, 2017 David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 1 / 36 Baseline Models David S. Rosenberg (Bloomberg ML EDU) October 26,

More information

Psych 230. Psychological Measurement and Statistics

Psych 230. Psychological Measurement and Statistics Psych 230 Psychological Measurement and Statistics Pedro Wolf December 9, 2009 This Time. Non-Parametric statistics Chi-Square test One-way Two-way Statistical Testing 1. Decide which test to use 2. State

More information

Review of Statistics 101

Review of Statistics 101 Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods

More information

TA: Sheng Zhgang (Th 1:20) / 342 (W 1:20) / 343 (W 2:25) / 344 (W 12:05) Haoyang Fan (W 1:20) / 346 (Th 12:05) FINAL EXAM

TA: Sheng Zhgang (Th 1:20) / 342 (W 1:20) / 343 (W 2:25) / 344 (W 12:05) Haoyang Fan (W 1:20) / 346 (Th 12:05) FINAL EXAM STAT 301, Fall 2011 Name Lec 4: Ismor Fischer Discussion Section: Please circle one! TA: Sheng Zhgang... 341 (Th 1:20) / 342 (W 1:20) / 343 (W 2:25) / 344 (W 12:05) Haoyang Fan... 345 (W 1:20) / 346 (Th

More information

AP Statistics Cumulative AP Exam Study Guide

AP Statistics Cumulative AP Exam Study Guide AP Statistics Cumulative AP Eam Study Guide Chapters & 3 - Graphs Statistics the science of collecting, analyzing, and drawing conclusions from data. Descriptive methods of organizing and summarizing statistics

More information

STATISTICS 141 Final Review

STATISTICS 141 Final Review STATISTICS 141 Final Review Bin Zou bzou@ualberta.ca Department of Mathematical & Statistical Sciences University of Alberta Winter 2015 Bin Zou (bzou@ualberta.ca) STAT 141 Final Review Winter 2015 1 /

More information

CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1

CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1 CptS 570 Machine Learning School of EECS Washington State University CptS 570 - Machine Learning 1 IEEE Expert, October 1996 CptS 570 - Machine Learning 2 Given sample S from all possible examples D Learner

More information

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout

More information

Goodness of Fit Tests

Goodness of Fit Tests Goodness of Fit Tests Marc H. Mehlman marcmehlman@yahoo.com University of New Haven (University of New Haven) Goodness of Fit Tests 1 / 38 Table of Contents 1 Goodness of Fit Chi Squared Test 2 Tests of

More information

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio Class 4: Classification Quaid Morris February 11 th, 211 ML4Bio Overview Basic concepts in classification: overfitting, cross-validation, evaluation. Linear Discriminant Analysis and Quadratic Discriminant

More information

Harvard University. Rigorous Research in Engineering Education

Harvard University. Rigorous Research in Engineering Education Statistical Inference Kari Lock Harvard University Department of Statistics Rigorous Research in Engineering Education 12/3/09 Statistical Inference You have a sample and want to use the data collected

More information

Lecture 41 Sections Mon, Apr 7, 2008

Lecture 41 Sections Mon, Apr 7, 2008 Lecture 41 Sections 14.1-14.3 Hampden-Sydney College Mon, Apr 7, 2008 Outline 1 2 3 4 5 one-proportion test that we just studied allows us to test a hypothesis concerning one proportion, or two categories,

More information

Chapter 9 Inferences from Two Samples

Chapter 9 Inferences from Two Samples Chapter 9 Inferences from Two Samples 9-1 Review and Preview 9-2 Two Proportions 9-3 Two Means: Independent Samples 9-4 Two Dependent Samples (Matched Pairs) 9-5 Two Variances or Standard Deviations Review

More information

Unit 9: Inferences for Proportions and Count Data

Unit 9: Inferences for Proportions and Count Data Unit 9: Inferences for Proportions and Count Data Statistics 571: Statistical Methods Ramón V. León 12/15/2008 Unit 9 - Stat 571 - Ramón V. León 1 Large Sample Confidence Interval for Proportion ( pˆ p)

More information

Chapter 10: Chi-Square and F Distributions

Chapter 10: Chi-Square and F Distributions Chapter 10: Chi-Square and F Distributions Chapter Notes 1 Chi-Square: Tests of Independence 2 4 & of Homogeneity 2 Chi-Square: Goodness of Fit 5 6 3 Testing & Estimating a Single Variance 7 10 or Standard

More information

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007) FROM: PAGANO, R. R. (007) I. INTRODUCTION: DISTINCTION BETWEEN PARAMETRIC AND NON-PARAMETRIC TESTS Statistical inference tests are often classified as to whether they are parametric or nonparametric Parameter

More information

Statistical methods for comparing multiple groups. Lecture 7: ANOVA. ANOVA: Definition. ANOVA: Concepts

Statistical methods for comparing multiple groups. Lecture 7: ANOVA. ANOVA: Definition. ANOVA: Concepts Statistical methods for comparing multiple groups Lecture 7: ANOVA Sandy Eckel seckel@jhsph.edu 30 April 2008 Continuous data: comparing multiple means Analysis of variance Binary data: comparing multiple

More information

Lecture 7: Hypothesis Testing and ANOVA

Lecture 7: Hypothesis Testing and ANOVA Lecture 7: Hypothesis Testing and ANOVA Goals Overview of key elements of hypothesis testing Review of common one and two sample tests Introduction to ANOVA Hypothesis Testing The intent of hypothesis

More information

Statistical methods in recognition. Why is classification a problem?

Statistical methods in recognition. Why is classification a problem? Statistical methods in recognition Basic steps in classifier design collect training images choose a classification model estimate parameters of classification model from training images evaluate model

More information

Probability: Why do we care? Lecture 2: Probability and Distributions. Classical Definition. What is Probability?

Probability: Why do we care? Lecture 2: Probability and Distributions. Classical Definition. What is Probability? Probability: Why do we care? Lecture 2: Probability and Distributions Sandy Eckel seckel@jhsph.edu 22 April 2008 Probability helps us by: Allowing us to translate scientific questions into mathematical

More information

Lecture 1: Probability Fundamentals

Lecture 1: Probability Fundamentals Lecture 1: Probability Fundamentals IB Paper 7: Probability and Statistics Carl Edward Rasmussen Department of Engineering, University of Cambridge January 22nd, 2008 Rasmussen (CUED) Lecture 1: Probability

More information

Probability and Discrete Distributions

Probability and Discrete Distributions AMS 7L LAB #3 Fall, 2007 Objectives: Probability and Discrete Distributions 1. To explore relative frequency and the Law of Large Numbers 2. To practice the basic rules of probability 3. To work with the

More information

Math Review Sheet, Fall 2008

Math Review Sheet, Fall 2008 1 Descriptive Statistics Math 3070-5 Review Sheet, Fall 2008 First we need to know about the relationship among Population Samples Objects The distribution of the population can be given in one of the

More information

Poisson regression: Further topics

Poisson regression: Further topics Poisson regression: Further topics April 21 Overdispersion One of the defining characteristics of Poisson regression is its lack of a scale parameter: E(Y ) = Var(Y ), and no parameter is available to

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University August 30, 2017 Today: Decision trees Overfitting The Big Picture Coming soon Probabilistic learning MLE,

More information

3 PROBABILITY TOPICS

3 PROBABILITY TOPICS Chapter 3 Probability Topics 135 3 PROBABILITY TOPICS Figure 3.1 Meteor showers are rare, but the probability of them occurring can be calculated. (credit: Navicore/flickr) Introduction It is often necessary

More information