Topics on Statistics 3

Similar documents
ST3241 Categorical Data Analysis I Two-way Contingency Tables. 2 2 Tables, Relative Risks and Odds Ratios

Introduction to Statistical Data Analysis Lecture 7: The Chi-Square Distribution

Probability and Statistics. Terms and concepts

Performance evaluation of binary classifiers

n y π y (1 π) n y +ylogπ +(n y)log(1 π).

Probability and Statistics. Joyeeta Dutta-Moscato June 29, 2015

Performance Evaluation and Hypothesis Testing

Performance Evaluation and Comparison

Statistics 3858 : Contingency Tables

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Evaluation requires to define performance measures to be optimized

13.1 Categorical Data and the Multinomial Experiment

Introduction to Supervised Learning. Performance Evaluation

Evaluation. Andrea Passerini Machine Learning. Evaluation

Lecture 01: Introduction

STAT 705: Analysis of Contingency Tables

BIOS 625 Fall 2015 Homework Set 3 Solutions

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur

STAC51: Categorical data Analysis

Chapter 26: Comparing Counts (Chi Square)

A.I. in health informatics lecture 2 clinical reasoning & probabilistic inference, I. kevin small & byron wallace

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing

Discrete Multivariate Statistics

STAT 4385 Topic 01: Introduction & Review

How do we compare the relative performance among competing models?

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Testing Independence

Sociology 6Z03 Review II

ST3241 Categorical Data Analysis I Two-way Contingency Tables. Odds Ratio and Tests of Independence

Chapter 2: Describing Contingency Tables - I

STAT Chapter 13: Categorical Data. Recall we have studied binomial data, in which each trial falls into one of 2 categories (success/failure).

Business Statistics. Lecture 10: Course Review

MA : Introductory Probability

The Chi-Square Distributions

The Multinomial Model

Quantitative Analysis and Empirical Methods

Least Squares Classification

Cohen s s Kappa and Log-linear Models

Q1 (12 points): Chap 4 Exercise 3 (a) to (f) (2 points each)

Smart Home Health Analytics Information Systems University of Maryland Baltimore County

STAT:5100 (22S:193) Statistical Inference I

Epidemiology Wonders of Biostatistics Chapter 11 (continued) - probability in a single population. John Koval

Performance Evaluation

Unit 9: Inferences for Proportions and Count Data

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

15: CHI SQUARED TESTS

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

Binary Logistic Regression

Categorical Variables and Contingency Tables: Description and Inference

Evaluation & Credibility Issues

Categorical Data Analysis Chapter 3

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

UNIVERSITY OF TORONTO Faculty of Arts and Science

PubH 5450 Biostatistics I Prof. Carlin. Lecture 13

The Naïve Bayes Classifier. Machine Learning Fall 2017

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

Interpret Standard Deviation. Outlier Rule. Describe the Distribution OR Compare the Distributions. Linear Transformations SOCS. Interpret a z score

Lecture 8: Summary Measures

Section 4.6 Simple Linear Regression

STAT 135 Lab 11 Tests for Categorical Data (Fisher s Exact test, χ 2 tests for Homogeneity and Independence) and Linear Regression

Glossary for the Triola Statistics Series

Log-linear Models for Contingency Tables

Probability Theory and Applications

Lecture 5: ANOVA and Correlation

Chapter 10. Discrete Data Analysis

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Inference for Binomial Parameters

Two-sample Categorical data: Testing

16.400/453J Human Factors Engineering. Design of Experiments II

Machine Learning Linear Classification. Prof. Matteo Matteucci

An introduction to biostatistics: part 1

Review of Statistics

Performance Evaluation

Psych 230. Psychological Measurement and Statistics

Review of Statistics 101

TA: Sheng Zhgang (Th 1:20) / 342 (W 1:20) / 343 (W 2:25) / 344 (W 12:05) Haoyang Fan (W 1:20) / 346 (Th 12:05) FINAL EXAM

AP Statistics Cumulative AP Exam Study Guide

STATISTICS 141 Final Review

CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Goodness of Fit Tests

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Harvard University. Rigorous Research in Engineering Education

Lecture 41 Sections Mon, Apr 7, 2008

Chapter 9 Inferences from Two Samples

Unit 9: Inferences for Proportions and Count Data

Chapter 10: Chi-Square and F Distributions

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

Statistical methods for comparing multiple groups. Lecture 7: ANOVA. ANOVA: Definition. ANOVA: Concepts

Lecture 7: Hypothesis Testing and ANOVA

Statistical methods in recognition. Why is classification a problem?

Probability: Why do we care? Lecture 2: Probability and Distributions. Classical Definition. What is Probability?

Lecture 1: Probability Fundamentals

Probability and Discrete Distributions

Math Review Sheet, Fall 2008

Poisson regression: Further topics

Machine Learning

3 PROBABILITY TOPICS

Transcription:

Topics on Statistics 3 Pejman Mahboubi April 24, 2018 1 Contingency Tables Assume we ask a sample of 1127 Americans if they believe in an afterlife world. The table below cross classifies the sample based on their gender and response. Here, Gender and Response are two categorical YES NO FEMALE 509 116 MALE 398 104 variables. Gender has 2 levels Male and Female. Response also has 2 levels Yes and No. In general with two categorical variables X with I levels and Y with J levels, we can build a I J contingency table which display the I J possible combinations of the count outcomes. From the table, we can answer the following questions: 1. Joint Probability of being male and not believing in afterlife. If we decide to let the coordinates denote the response and gender respectively, the we can write P(No, Male) = 104 1127 (1) 2. Conditional Probability of not believing in after life given or conditioned on the respondent is male. Since the total number of males is 502 = 104 + 398, and 104 of them don t believe in afterlife, Conditional probability can be defined based on joint probability. P(No Male) = 104 502, (2) Definition 1 (Conditional Probability). For two events A and B with P(B) > 0, P(A B) = P(A, B) P(B), (3) which can also be written as P(A, B) = P(B) P(A B). (4) Example 1. Using definition of the conditional probability, we have 104 P(No, M) P(No M) = P(Gender = M) = 1127 502 1127 = 104 502 1

2 Measures of Accuray Assume you train a classifier machine CL which predicts the gender based on subjects religosity. CL(religious) = F CL(non-religious) = M, Assume, we apply this classifier to the test dataset of 20 individuals and get the following result > CL Gender Religiosity prdgender class 1 F Y F TRUE+ 2 F Y F TRUE+ 3 F Y F TRUE+ 4 F Y F TRUE+ 5 F N M FALSE- 6 F Y F TRUE+ 7 F Y F TRUE+ 8 F N M FALSE- 9 F N M FALSE- 10 F Y F TRUE+ 11 F Y F TRUE+ 12 F Y F TRUE+ 13 M N M TRUE- 14 M Y F FALSE+ 15 M N M TRUE- 16 M Y F FALSE+ 17 M N M TRUE- 18 M Y F FALSE+ 19 M Y F FALSE+ 20 M Y F FALSE+ Let s assume, F and M represent the positive and negative classes respectively. For example we predicted that samples on row 5, 8, 9, 13, 15, 17 are + and the rest are. Then, by comparing our predictions with the gender (first column), we can tell which of our predictions were T RUE or F ALSE. This way, we put our predictions in four categories: TRUE+ or TP, TRUE- or TN, FALSE- or FN and FALSE+ or FP. The following cross classification gives us the number of our predictions in each class: > (tbl<-table(cl[,c("gender","prdgender")])) prdgender Gender F M F 9 3 M 5 3 Therefore T P = 9 F N = 3 F P = 5 T N = 3 People in positive class are the ones who are correctly predicted positive or wrongly predicted negative: P = T P + F N N = T N + F P The diagonal elements are the counts of true positive and true negative predictions. One simple and highly intuitive measure of accuracy is accuracy = T P + T N T OT AL = 9 + 3 = 0.6. 20 2

There is a major issue with this measure, because sometimes it can fool us. Assume an endemic disease infected 95% of a population. Then if we have a sample of 100 ppl, and with no testing, just predict all of them as infected, then we will have a high true positive and 0 true negative. Therefore the accuracy measure of our trivial model would be: accuracy = 95 + 0 =.95 100 2.1 Sensitivity and Specificity In the picture blow, you see a square which is divided to two rectangles. The square includes all points in the dataset while the left and right rectangles denote the + and classes. We have a model that predicts the points inside the circle are + and the rest as negative. Then the left half-disk contains the true positive and the right half-disk the false positive. There are different ways of measuring accuracy of classifiers. 1. Sensitivity, Recall or True positive rate is the probability that a positive sample (left rectangle) is predicted as positive (in the disk) Sensitivity = left half-disk left rectangle 2. Specificity is the probability the diagnostic test will show negative, given the subject does not have the disease right rectangle right half-disk Specificity = right rectangle sensitivity = 9 9 + 3 = 0.75 Remark 1. We can also define a false positive rate as 1 specificity. Let s see an example in context of testing for a disease. Assume a screening method for a rare disease has sensitivity =.86 and specificity.88. This means that P(X = 1 Y = 1) =.86, 1 denotes the positive class P(X = 2 Y = 2) =.88, 2 denotes the negative class Furthermore, assume only 1% of population is infected by this disease. A person takes the test and X = 1 (test is positive). What is the probability that he has the disease, Y = 1. Solution. Here we want to find P(Y = 1 X = 1). By the Bayes rule P(Y = 1 X = 1) = The numerator is readily computed as P(X = 1 Y = 1) P(Y = 1) P(X = 1) P(X = 1 Y = 1) P(Y = 1) =.86.01 =.0086 For denominator we have, P(X = 1) = P(X = 1 Y = 1) P(Y = 1) + P(X = 1 Y = 2) P(Y = 2) =.0086 +.12.99 = 0.1274, which still a very small number. Here, P(Y = 1) =.01 is called Bayesian prior and 0.1274 is the posterior. 3

where Y = 1 means the subject has the disease and X = 1 means the result of the test is positive. Another way of measuring accuracy of classifiers is by Precision and Recall. Using the picture on the right we have Precision = T P T P + F P In our example, Recall=.75 and Precision is Recall = T P T P + F N = Sensitivity Precision = 9 = 0.6428571 (5) 9 + 5 Function confusionmatrix() from library{caret} takes prediction columns and the true values and computes the contingency table, precision, recall, sensitivity, specificity and much more. You see that it generates a confidence interval for the accuracy. It is because the test data set is a random sample. > library(caret) > confusionmatrix(cl$prdgender,cl$gender) Confusion Matrix and Statistics Reference Prediction F M F 9 5 M 3 3 Accuracy : 0.6 95% CI : (0.3605, 0.8088) No Information Rate : 0.6 P-Value [Acc > NIR] : 0.5956 Kappa : 0.1304 Mcnemar's Test P-Value : 0.7237 Sensitivity : 0.7500 Specificity : 0.3750 Pos Pred Value : 0.6429 Neg Pred Value : 0.5000 Prevalence : 0.6000 Detection Rate : 0.4500 Detection Prevalence : 0.7000 Balanced Accuracy : 0.5625 'Positive' Class : F There is a trade off between Precision and Recall in the sense that, if we try to improve one of them in our model, the other will decrease. 3 Marginal Probabilities and Independence Remember the result of the survey: > table(df) 4

Response Gender NO YES FEMALE 116 509 MALE 104 398 We can normalize the table by dividing each cell by total number of participants, i.e, 1127, to define a joint probability on the product space of G R as follows > prop.table(table(df)) Response Gender NO YES FEMALE 0.10292813 0.45164153 MALE 0.09228039 0.35314996 This means that P gives probabilities to pairs of gender and response. For example P(M, N) =.0923 We can normalize the table in different ways. For example, if we divide the first row and second row by the their corresponding total numbers, > (G.cond<-prop.table(table(df),1)) Response Gender NO YES FEMALE 0.1856000 0.8144000 MALE 0.2071713 0.7928287 we get conditional probabilities conditioned on Gender. The first row is conditioned on Gender = F and the second row conditioned on Gender = M and we have P(N F ) = 0.1856 P(Y M) = 0.7928287 Similarly we can condition on the response (probabilities in each column adds up to 1) > (R.cond<-prop.table(table(df),2)) Response Gender NO YES FEMALE 0.5272727 0.5611907 MALE 0.4727273 0.4388093 P(F N) = 0.527 = 1 P(M N) The third way of normalizing is marginalizing. For example the marginal probability of GENDER is > prop.table(table(df$gender)) FEMALE MALE 0.5545697 0.4454303 or > prop.table(table(df$response)) NO YES 0.1952085 0.8047915 5

We can compute the marginal probabilities from the joint probabilities. For example P(F ) = P(F, Y ) + P(F, N) = 0.10292813 + 0.45164153 = 0.5545697, because events R = Y and R = N partition the sample space, i.e., every subject falls in one of theses two sets and no subject falls in both events: P(Y N) = 1 P(Y, N) = 0. (6) This is an example of Law of Total probability. To state this law formally, we need to give a definition of a partition. Definition 2. A collection of events A 1,, A n form a partition of the sample space if they satisfy the following two conditions 1. They are mutually disjoint: P(A i, A j ) = 0 i j (7) 2. They cover the entire sample space together Theorem 1 (Law of Total Probability). Let A 1,, A n be a partition of a sample space. Then for any event B, S = A 1 A n (8) P(B) = P(B, A 1 ) + + P(B, A n ) sum of joint probabilities (9) P(B) = P(A 1 )P(B A 1 ) + + P(A n )P(B A n ). (10) So the marginal probabilities of an event A is obtained by summing up all joint probabilities whose one of the inputs (margins) is A. In the plot (left) A 1, A 2, A 3 form a partition for S S the sample space S. Equation (10) is also referred to as the Law of Total Conditional Probability, which is A1 A2 A3 B readily derived from (9) using the identity P (B, A n ) = P(A n )P(B A n ), B1 B2 B3 S see definition of the conditional probability and (4). You can check that a conditional probability can be derived by dividing joint probability by marginal probability. For example check that P(N F ) = P(N, F ) P(F ) = 0.10292813 0.5545697 = 0.1856 3.1 Independence 3.1.1 Independence of Two Events Two events A, and B are independent with respect to a probability P : S [0, 1] if P(A, B) = P(A) P(B), which is equivalent to P(A B) = P(A), We interpret the last one as information about B doesn t doesn t change information about A. 6

3.1.2 Independent Random Variables Give two categorical variables X with I levels and Y with J, and joint probability density P(X = i, Y = j) let s define the following notations: We also define notations for marginals: π ij = P(X = i, Y = j) i = 1,, I and j = 1,, J (11) π i+ = j π ij = j P(X = i, Y = j) = P(X = i) for i = 1,, I (12) π +j = i π ij = i P(X = i, Y = j) = P(Y = j) for j = 1,, J. (13) levels are independent, if for any i {1,, I} and j {1,, J}, the joint probability of the events equals the product of the marginals: or, using definition of conditional probability, i.e., conditional probability equals the marginal probability! P(X = i, Y = j) = P(X = i) P(Y = j) (14) P(X = i Y = j) = P(X = i) i and j, (15) Example 2. There are 100 blue, black and red balls in a jar. Each ball is either wooden or glass. The cross classification is given above. Is color independent of type? Blue Black Red Glass 8 16 16 Wood 12 24 24 Solution. Joint probabilities are Blue Black Red Glass 0.08 0.16 0.16 Wood 0.12 0.24 0.24 Marginals are > (m1<-apply(joint,1,sum)) Glass Wood 0.4 0.6 > (m2<-apply(joint,2,sum)) Blue Black Red 0.2 0.4 0.4 The product holds. 7

YES NO FEMALE 509 116 MALE 398 104 Table 1: Cross Classifying Contingency Table NO YES FEMALE 0.19 0.81 MALE 0.21 0.79 Table 2: Conditional Probabilities 4 Comparing Probabilities in 2 2 Contingency Tables In our 2 2 contingency table, think of levels of gender (Female, Male) as the explanatory random variable or groups that predict the response variable (Yes, No). Then we can think of p 1 = 0.81 and p 2 = 0.79 as probabilities of success (Yes) in each groups. Remark 2. Here we tacitly assume that we are taking number of males and females fixed (non-random). Our analysis doesn t say if we repeat the sample, what would be our best guess for number of males and females. It only analyzes the range of probabilities p 1 and p 2 in each group. Let π 1 and π 2 denote the true rates of Response = Y ES in the female and male populations respectively. Then Remark 2 implies that Y ES and NO responses in each group follows Bernoulli distributions with parameters π 1 and π 2 which are unknown to us. Remark 3. If B is a Bernoulli random variable with parameter p, then mean(b) = p and V ar(b) = p(1 p). We know the sample rates are p 1 = 0.81 and p 2 = 0.79. Can we compute a 95% confidence interval for π 1 = π 2? The rate of success p 1 = 1+1+0+ +0+1 402, where we put 1 and 0 for Y ES and NO responses respectively, for 402 male participants. Therefore, we can think of p 1 and p 2 as random sample means. By the Central Limit Theorem, we can assume they are sampled from two normal random variables p 1 and p 2, that are distributed normally. More precisely, p 1 N(π 1, σ1) 2 p 2 N(π 2, σ2), 2 (16) where σ1 2 = π1(1 π1) 625 and σ2 2 = π2(1 π2) 502. Since we don t have π 1 and π 2, we use p 1 and p 2 as an approximation and for π 1 and π 2 and write s 2 1 and s 2 2 instead of σ1 2 and σ2. 2 Therefore we have > p1=0.81;p2<-0.79 > (s1<-p1*(1-p1)/625) [1] 0.00024624 > (s2<-p2*(1-p2)/502) [1] 0.0003304781 Therefore p 1 N(0.81, 0.00024624) p 2 N(0.79, 0.0003304781). (17) Now, let s discuss p 1 p 2. But first a theorem! In the following theorem pay attention that the variance is always the sum of the variances. Theorem 2. If X 1 and X 2 are independent normal random variables with parameters mean and variance (m 1, σ 2 1) and (m 2, σ 2 2), then X 1 ± X 2 is normal with parameters (m 1 ± m 2, σ 2 1 + σ 2 2) Therefore, p 1 p 2 is a normal distribution with mean =.81.79 =.02 and Therefore, the 95% confidence interval is SE = 0.00024624 + 0.0003304781 = 0.02401496 (18) [.02 1.98 0.02401496,.02 + 1.98 0.02401496] = [ 0.028, 0.068] We can also perform hypothesis testing. Assume we want to check if H 0 : π 1 = π 2 vs H 1 : π 1 π 2. 8

Remark 4. π 1 = π 2 means P(R = Y F ) = P(R = Y M). Then P(Y, F )P(M) = P(Y, M)P(F ). Therefore, P(Y, F ) P(Y, F )P(F ) = P(Y, M)P(F ), which implies that P(Y,F ) P(F ) = P(Y, M) + P(Y, F ). Therefore P(Y F ) = P(F ), which implies that P(N F ) = P(N). Similarly, we can check that P(Y M) = P(Y ), which implies that P(N Y ) = P(N). Therefore, we have Response is independent of Gender. Remember how we approximated σ1 2 and σ2 2 in (16) by s 2 1 and s 2 2. In hypothesis testing, under the null hypothesis π 1 = π 2 we can do a better job, thanks to Remark 4. The pooled variance is a common variance σ 2, closely related to the between variation in AOV, replacing both σ1 2 and σ2. 2 If π 1 = π 2, then as mentioned in Remark 4, Response is independent of Gender. Therefore, two samples are from the same populations. Therefore, σ1 2 = σ2 2 = σ 2. And σ 2 is the average of variances based on the samples we have. It is calculated as, (n 1 1)s 2 1 s p = + (n 2 1)s 2 2 = 0.01685372, n 1 + n 2 2 compare with (18). If π 1 = π 2, then p 1 p 2 N(0, 0.01685372). The number we sampled is p 1 p 2 =.02. What is the chance to with N(0, 0.01685372) sample a number.02 or further distance from the center 0? > 2*pnorm(q = -.02,mean = 0,sd = 0.01685372,lower.tail = T) [1] 0.2353532 Therefore, we cannot reject the null hypothesis. 5 Odds and Odds Ratio If π is the rate of success in a binomial trial, then its corresponding odds is defined to be odds = π 1 π. (19) If odds = 4, then success is 4 times as likely as a failure. We expect to see, on average, 4 successes for each failure. We of course can retrieve π if we know its corresponding odds by π = odds/(odds + 1). Every 2 2 contingency table induces two rates of success π 1 and π 2 corresponding to its rows. Let odds 1 and odds 2 be the odds corresponding to π 1 and π 2. By dividing the odds 1 by odds 2 we find another measure of association between the rows. This measure, denoted by θ, is called the odds ratio and is defined by θ = odds1 odds2 = π 1/(1 π 1 ) π 2 /(1 π 2 ). (20) Odds ratios are positive numbers in interval θ (0, ). θ = 4 means the odds of group in the first row is 4 times the odds of the group in second row. θ = 1 4 means the opposite is true, i.e., the odds of group in the second row is 4 times the odds of the group in first row. θ = 1 means the odds are equal, which implies that π 1 = π 2. In general θ > 1 implies π 1 > π 2, θ = 1, implies π 1 = π 2 and θ < 1 implies that π 1 < π 2. Furthermore, for any positive number α > 0, θ = α and θ = 1/α are convey opposite implications about odds of the 2 groups. As we always do in statistics, we have only the sample odds, which is defined by ˆθ = p 1/(1 p 1 ) p 2 /(1 p 2 ) Consider two population with equal odds. Then the sampling odds ratio will be around 1. You can see that the left tail is in (0, 1) and right tail in (1, ). Therefore, the sampling distribution of the odds ratio is highly skewed. But if we consider log θ instead of θ, then we will have nicer, and more intuitive properties. for example 9

log θ = 0 (i.e θ = 1) implies π 1 = π 2. log θ = 2 and log θ = 2 are symmetric around 0 and convey opposite statement about π 1 and π 2. The sample log odds ratio, log ˆθ, has a less skewed sampling distribution that is bell-shaped with standard deviation given by 1 SE = + 1 + 1 + 1, (21) n 11 n 12 n 21 n 22 where n ij are the counts in the contingency table. Example 3. In our contingency table of the afterlife belief, compute log ˆθ and a 95% confidence interval for the log θ. Solution. Since the sample logˆθ and standard deviation are > odds.f<- 0.8144000/(1-0.8144000) > odds.m<- 0.7928287/(1-0.7928287) > (p<-log(odds.f/odds.m)) [1] 0.1367966 > (SE=sqrt((1/509)+(1/116)+(1/398)+(1/104))) [1] 0.1507092 Then the lower and upper limits of the 95% CI are > (lower<-p-1.96*se) [1] -0.1585935 > (upper<-p+1.96*se) [1] 0.4321867 Since zero is included in the interval then log θ = 0 is a possibility. Therefore, π 1 = π 2 is a possibility with 95% chance. By exponentiating, we find that the 95% CI for θ is [0.8533432, 1.540623] 5.1 Contingency Tables and Chi-Square test A 2000 General Social Survey, cross classifies 2757 subjects based on gender and their political party as below This table defines a sample joint probability p = {p 11, p 12, p 13, p 21, p 22, p 23 } that is Democrat Independent Republican Female 762 327 468 Male 484 239 477 Democrat Independent Republican Female 0.28 0.12 0.17 Male 0.18 0.08 0.17 Of course p is random as it is defined by a random sample. Does there enough evidence there to reject H 0 defined by H 0 : π = {0.25, 0.1, 0.25, 0.15, 0.1, 0.15}? 10

Democrat Independent Republican Female 689.25 275.70 689.25 Male 413.55 275.70 413.55 Solution. The expected number for each cell µ = µ ij, based on π = π ij is obtained by µ = π 2757. So we have There are 6 residuals, which are the difference between the expected(fitted) value and sample(actual) value. The residual squares are > sample<-c(762,327,468,484,239,477) > expected<-c(689.25,275.70,689.25,413.55,275.70,413.55) > res.sq<-c(sample-expected)^2 Bigger residuals are stronger evidences against H 0. 5.2 Chi-squared Distribution Democrat Independent Republican Female 5292.56 2631.69 48951.56 Male 4963.20 1346.89 4025.90 Chi-squared distribution also denoted by χ 2 distribution with k degrees of freedom, is the sum of square of k independent standard normal distributions. Think of residuals in a contingency table. They are approximately normal and after dividing by their standard deviation they become standard normal. Definition 3 (Wikipedia). If Z 1,, Z k are independent, standard normal random variables, then the sum of their squares, k Q = Zi 2 (22) i=1 is distributed according to the chi-squared distribution with k degrees of freedom. This is usually denoted as Q χ 2 k. Furthermore, if X χ 2 k, then EX = k and V arx = 2k. 11

The graph of the densities above shows how it become closer to a normal density as degrees of freedom increases. In the discussion that comes next, we will talk about the mean and variance of the frequencies of each cell, not to be confused with the mean and variance of Q. 5.3 Simulating The Contingency Table Assume we know the population probabilities of each cell: π = {π 11, π 12, π 13, π 21, π 22, π 23 }. Then the counts of these 6 cells follow a multinomial distribution. For example with 1000 people we might get > set.seed(1001) > pi<-c(.1,.3,.2,.1,.2,.1) > r=14 > N=1000 > (sample<-rmultinom(n = r,size = N,prob = pi)) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [1,] 106 114 118 103 128 86 109 103 93 106 128 94 106 103 [2,] 291 282 282 282 290 320 281 307 306 296 288 299 305 309 [3,] 199 199 185 194 195 198 221 195 203 207 216 192 197 190 [4,] 97 91 104 116 89 93 92 97 93 105 85 99 83 109 [5,] 195 219 187 190 192 187 211 208 190 183 195 220 193 190 [6,] 112 95 124 115 106 116 86 90 115 103 88 96 116 99 Each column is a random sample for the 6 cells of the contingency table and each row are 10 samples for one of the 6 cells. Check that each cell looks normal with mean = N π and sd = N π(1 π) > (mean<-1000*pi[1]) [1] 100 > (sd=sqrt(1000*pi[1]*(1-pi[1]))) [1] 9.486833 > hist(sample[1,]) Histogram of sample[1, ] Frequency 0 1 2 3 4 5 6 7 80 90 100 110 120 130 sample[1, ] 12

So, each cell cell i,j is a binomial with parameter π ij, with mean equal to Nπ ij and variance Nπ ij (1 π ij ), see Theorem 3. Therefore, by CLT, (o Nπ ij )/ Nπ ij (1 π ij ) N(0, 1). 5.4 A Two-Cell Model Let s see what this yields when there are only two cells, one row and 2 columns, i.e, j = 1 and i = 1, 2. Let π 1 and π 2 denote probabilities of cell 1 and cell 2: π 1 + π 2 = 1. Furthermore, let 1. E i = Nπ i denote the mean (Expected value) of the cell i. 2. O i the observation in cell i. Therefore, O 1 + O 2 = N. 3. Define Q i = Oi Nπi. Nπi(1 π i) By CLT, Q 1 is sampled from an approximately normal distribution, see Theorem 3. And Q 2 1 is Q 2 1 = (O 1 E 1 ) 2 Nπ(1 π) = (O 1 Nπ) 2 Nπ after some algebraic manipulations + (O 2 N(1 π)) 2 N(1 π) = (O 1 E 1 ) 2 E 1 + (O 2 E 2 ) 2 E 2. Therefore, in a two cell model, if we compute (Oi Ei)2 E i for each cell and add them up, the result is a χ 2 1. Theorem 3. If X is a binomial random variable with parameters N and π, where N is the number of trials and π the rate of success, then EX = Nπ sdx = Nπ(1 π) (23) 5.5 Six-cell Model Same holds when there are 6 cells. We want to test the null hypothesis where H 0 and H 1 are given by H 0 : π = (π 1,, π 6 ), H 1 : π is not as given by H 0 (24) and we have O 1,, O 6 observations. Furthermore the total number of observations N = O 1 + + O 6 can be computed. 1. Compute E i = Nπ i, for i = 1,, 6 2. Compute (Oi Ei)2 E i for i = 1,, 6, 3. Compute the test statistic Q = 6 (O i E i) 2 i=1 E i Q is a number, which under H 0 is sampled from χ 2 5 : Q χ 2 5. Check the table to see what is the chance p v al value that one would sample Q or bigger from χ 2 5. If p v al <.05, you reject H 0. 5.6 Example Continued, Goodness of the Fit For our cross classification of the gender-political party, we computed all the summands. If we add them up we get > (chi.sq<-sum(res.sq/expected)) [1] 114.8675 This number is sampled from χ 2 5. What is the chance that χ 2 5 generates 114.87 or bigger? 13

> (p_val<-pchisq(q = chi.sq,df = 5,lower.tail = FALSE)) [1] 3.830439e-23 And we reject H 0. This is an example of checking the goodness of the fit using chi-squared test. We are given observation and our model is basically defined by π 1,, π 6, or same as π ij, i = 1,, 3 j = 1, 2. In this case we concluded that our model defined by H 0 is not appropriate. Example 4. Assume we are given 20 numbers and we want to see if it is acceptable to assume they are sampled from a normal distribution. Let s assume the numbers are [1] 88 89 90 90 93 94 99 100 101 102 103 104 107 107 107 108 114 117 118 [20] 120 > hist(t) Histogram of t Frequency 0 1 2 3 4 85 90 95 100 105 110 115 120 t Solution. If the numbers are from a normal distribution, then the mean and variance would be > (m<-mean(t)) [1] 102.55 > (sd<-sd(t)) [1] 9.923258 The range starts from 88 and ends with biggest number120. Let s make 3 cells. Cell 2 for all observations within one standard deviation of the mean, i.e., all observations in interval Cell 2 = [92.62674, 112.4733]. Cell 3 = (112.4733, ) and Cell 1 = (, 92.62674]. If numbers are from N(102, 10), then we can find probability of each cell. Actually we know from the picture below that > pi<-c(.16,.68,.16) > o1<-sum(t<(m-sd));o3<-sum(t>(m+sd));o2<-(length(t)-(o1+o3)) > (ob<-c(o1,o2,o3)) 14

[1] 4 12 4 > (E<-pi*20) [1] 3.2 13.6 3.2 > (res<-((ob-e)^2)/e) [1] 0.2000000 0.1882353 0.2000000 > (chi.sq<-sum(res)) [1] 0.5882353 There are 3 cells, therefore, there are 2 degrees of freedom. Is χ 2 2 =.588 too big? > (p_val<-pchisq(q = chi.sq,df = 2,lower.tail = FALSE)) [1] 0.7451888 No, we cannot reject the possibility that numbers are sampled from N(102, 10). 5.7 Test of Independence We dealt with independence at Example 2. What is different here? In Example 2 we had access to the entire population (the jar of the balls) and could compute π ij for each cell. Here, we have a sample, and need a more powerful theory to infer about the population probabilities. We cannot apply the definition of independence to the probabilities p derived from the contingency table, as they are estimates of π ij at best and fluctuate. 5.7.1 Structure of H 0 In the χ 2 test of independence, H 0 is different than H 0 for goodness of the fit as in (24). Here, instead of joint probabilities π ij, we are given the observations. Then we can add up observations in columns and rows to compute the marginals, π i+ and π +j, see (12). 5.7.2 Degrees of Freedom Furthermore, when testing the goodness of the fit, the only restrain on joint probabilities π ij is that π ij = 1, (25) ij therefore, there are I J 1 degrees of freedom. When testing independence marginals are computed from the observations. Using marginals, we compute joint probabilities under the independence condition, see (14). Therefore, instead of ij π ij = 1, joint probabilities in each row and each column should add up to the first marginals and the second marginals. Therefore, the degrees of freedom are (I 1) (J 1). 5.7.3 How It Works Assume there are I rows and J columns. Therefore, i ranges in 1,..., I and j in 1,, J. We are given observations O ij for all i, j. Therefore, we can compute the sample marginals p i+ and p +j. Use the sample marginal as an approximation for π i+ and π +j. Use π i+ and π +j to compute the joint probabilities π ij under the independence condition. Then We want to test the hypothesis that. Finally, test the hypothesis that the observations is consistent with the joint probabilities H 0 : O ij is sampled from π ij for all i, j H 1 : O ij is not sampled from π ij at least for one i, j 15

Solution. We discuss below the procedure step-by-step. Remember that we are only given the observations O ij 1. Let O = ij O ij denote the total number of observations. 2. Add observations in each row and divide by O to obtain row marginals π 1+,, π I+. 3. Add observations in each column and divide by O to obtain column marginals π +1,, π +J. 4. Under independence, we can compute the joint probabilities π ij = π i+ π +j. 5. Compute all the expected observations E ij = π ij O. 6. Compute (Oij Eij)2 E ij for all ij. 7. Compute the statistic test: Q = ij (O ij E ij ) 2 E ij (26) 8. Compute p value p v al, the right tail of χ 2 df (J 1). that is bigger than Q, using degrees of freedom (I 1) Let me sample from from the jar in Example 2. The sample is with replacement, so size of the sample could be bigger than 100. Test the hypothesis that H 0 : Color of the balls is independent of its type. Blue Black Red Glass 2 6 8 Wood 7 14 13 To be able to use R, we store these numbers in a matrix: > (a<-matrix(data = c(2,6,8,7,14,13),nrow = 2,byrow = TRUE, + dimnames = list(c("glass","wood"),c("blue","black","red")))) Blue Black Red Glass 2 6 8 Wood 7 14 13 1. Total number of observation is : O = 2 + 6 + 8 + 7 + 14 + 13 = 50 2. First marginal is m.1 = [(2 + 6 + 8)/50, (7 + 14 + 13)/50]: > (m.1<-apply(x = a,margin = 1,FUN = sum)/50) Glass Wood 0.32 0.68 3. second marginal: > (m.2<-apply(x = a,margin = 2,FUN = sum)/50) Blue Black Red 0.18 0.40 0.42 4. Joint probabilities: 16

Blue Black Red Glass 0.0576 0.128 0.1344 Wood 0.1224 0.272 0.2856 5. Expected observations > (E<-jp*50) Blue Black Red Glass 2.88 6.4 6.72 Wood 6.12 13.6 14.28 6. Compute residuals squared divided by expected value > (R<-((E-a)^2)/E) Blue Black Red Glass 0.2688889 0.02500000 0.2438095 Wood 0.1265359 0.01176471 0.1147339 7. Compute Q: > (chi.sq<-sum(r)) [1] 0.790733 8. degrees of freedom is (2 1) (3 1) = 2 9. Compute p value > (p_val<-pchisq(q =.790733,df = 2,lower.tail = FALSE)) [1] 0.6734332 We cannot reject independence OR YOU CAN SIMPLY FEED YOUR DATA TO THE FOLLOWING COMMAND IN R > chisq.test(a) Pearson's Chi-squared test data: a X-squared = 0.79073, df = 2, p-value = 0.6734 5.8 ROC curve and AUC Assume an endemic affected 10% of a population. We designed a classifiers that generate probabilities p of being in positive class (diseased). We gather two populations of 500 healthy and 50 diseased patients and look at he distribution of scores that the classifier spits out for each group. Assume we get the following results > H.score<-rnorm(500,.3,.15) > S.score<-rnorm(50,.7,.1) Lets look at the overlapping distributions of the two groups of scores in a plot. The vertical line is a threshold which indicates the prediction rule, scores on the right (bigger than the threshold) are predicted sick and scores on the left are predicted healthy. Then 1. Red on the left of the vertical line means : TRUE NEGATIVE 17

2. Red on the right of the vertical line means: FALSE POSITIVE 3. Green on the right of the vertical line means: TRUE POSITIVE 4. Green on the left of the vertical line means: FALSE NEGATIVE We can place the vertical line at any x (0, 1) and compute the corresponding true positive rate and false positive rate. The plot in false positive-true positive space is a curve. Assume from the data we know the positive class and negative class (this is not based on prediction, but based on the label of the data). For example H.score has 500 members. Therefore we know that TN+FP=500. Similarly, there are 50 sick people which means TP+FN=50. Before we proceed further to compute the rates, let s redefine S.score and H.score. > library(ggplot2) > dat<-data.frame(dens = c(h.score,s.score),lines=rep(c("healthy", "Sick"), c(500,50))) > ggplot(dat, aes(dens, fill = lines)) + geom_histogram(position = "dodge")+ + geom_vline(xintercept =.6) 40 count lines Healthy Sick 20 0 0.00 0.25 0.50 0.75 dens 18

> H.score<-rnorm(500,.4,.2) > S.score<-rnorm(50,.6,.2) Therefore for a given threshold, say.6, we have > threshold=.6 > lp<-length(pclass<-s.score) > ln<-length(nclass<-h.score) > c(tpr=sum(pclass>threshold)/lp,fpr=sum(nclass>threshold)/ln) tpr fpr 0.46 0.18 where tpr and fpr stand for true positive rate and false positive rate respectively. We can compute tpr and fpr for different thresholds > ts<-seq(from =.01,to=.99,by=.01) > K<-unname(unlist(lapply(X = ts,fun = function(threshold) + c(tpr=sum(pclass>threshold)/length(pclass), + fpr=sum(nclass>threshold)/length(nclass))))) Then we can separate tpr from fpr > tpr<-k[seq(from=1,to = 197,by = 2)] > fpr<-k[seq(from=2,to = 198,by = 2)] Then we can plot it > plot(fpr,tpr,type = 'l') tpr 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 fpr The curve produced this way is called ROC curve and the area under the curve equals to the probability that 19

the classifier will rank a randomly chosen positive example higher than a randomly chosen negative example.[from https://stats.stackexchange.com/questions/132777/what-does-auc-stand-for-and-what-is-it]. This fact allows us to compute the area under the curve by sampling and counting > p = replicate(50000, sample(pclass, size=1) > sample(nclass, size=1)) > (mean(p)) [1] 0.70232 Let s repeat the process with a modified classifier which gives different (better) scores > H.score<-rnorm(500,.4,.15)->nclass > S.score<-rnorm(50,.6,.12)->pclass > ts<-seq(from =.01,to=.99,by=.01) > K<-unname(unlist(lapply(X = ts,fun = function(threshold) + c(tpr=sum(pclass>threshold)/length(pclass), + fpr=sum(nclass>threshold)/length(nclass))))) Then we can separate tpr from fpr > tpr<-k[seq(from=1,to = 197,by = 2)] > fpr<-k[seq(from=2,to = 198,by = 2)] Then we can plot it > plot(fpr,tpr,type = 'l') tpr 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 fpr 20

> p = replicate(50000, sample(pclass, size=1) > sample(nclass, size=1)) > (mean(p)) [1] 0.85514 You can play with the distributions of pclass and nclass to repeat the procedure above to see that as scores become more concentrated and more separate the area under the curve increases. 21