Table 12.1: Contingency table. Feature b. 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b. ... a N a1 N a2 N ab

Similar documents
Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)

1 Inferential Methods for Correlation and Regression Analysis

Problem Set 4 Due Oct, 12

( θ. sup θ Θ f X (x θ) = L. sup Pr (Λ (X) < c) = α. x : Λ (x) = sup θ H 0. sup θ Θ f X (x θ) = ) < c. NH : θ 1 = θ 2 against AH : θ 1 θ 2

Chapter 22. Comparing Two Proportions. Copyright 2010 Pearson Education, Inc.

1 Models for Matched Pairs

Common Large/Small Sample Tests 1/55

Estimation for Complete Data

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Direction: This test is worth 150 points. You are required to complete this test within 55 minutes.

Agreement of CI and HT. Lecture 13 - Tests of Proportions. Example - Waiting Times

Lecture 2: Monte Carlo Simulation

Math 152. Rumbos Fall Solutions to Review Problems for Exam #2. Number of Heads Frequency

Topic 9: Sampling Distributions of Estimators

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Topic 9: Sampling Distributions of Estimators

Chapter 22. Comparing Two Proportions. Copyright 2010, 2007, 2004 Pearson Education, Inc.

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

Topic 9: Sampling Distributions of Estimators

Inferential Statistics. Inference Process. Inferential Statistics and Probability a Holistic Approach. Inference Process.

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

1 Review of Probability & Statistics

UCLA STAT 110B Applied Statistics for Engineering and the Sciences

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Lecture 7: Properties of Random Samples

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Kurskod: TAMS11 Provkod: TENB 21 March 2015, 14:00-18:00. English Version (no Swedish Version)

Random Matrices with Blocks of Intermediate Scale Strongly Correlated Band Matrices

Properties and Hypothesis Testing

STA Learning Objectives. Population Proportions. Module 10 Comparing Two Proportions. Upon completing this module, you should be able to:

CS284A: Representations and Algorithms in Molecular Biology

Bayesian Methods: Introduction to Multi-parameter Models

Because it tests for differences between multiple pairs of means in one test, it is called an omnibus test.

General IxJ Contingency Tables

Chapter 11: Asking and Answering Questions About the Difference of Two Proportions

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

A quick activity - Central Limit Theorem and Proportions. Lecture 21: Testing Proportions. Results from the GSS. Statistics and the General Population

Chapter 13, Part A Analysis of Variance and Experimental Design

Section 14. Simple linear regression.

Parameter, Statistic and Random Samples

Stat 319 Theory of Statistics (2) Exercises

Tests of Hypotheses Based on a Single Sample (Devore Chapter Eight)

April 18, 2017 CONFIDENCE INTERVALS AND HYPOTHESIS TESTING, UNDERGRADUATE MATH 526 STYLE

Chapter 8: STATISTICAL INTERVALS FOR A SINGLE SAMPLE. Part 3: Summary of CI for µ Confidence Interval for a Population Proportion p

REGRESSION WITH QUADRATIC LOSS

Statistical and Mathematical Methods DS-GA 1002 December 8, Sample Final Problems Solutions

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

Frequentist Inference

7-1. Chapter 4. Part I. Sampling Distributions and Confidence Intervals

Lecture 3. Properties of Summary Statistics: Sampling Distribution

January 25, 2017 INTRODUCTION TO MATHEMATICAL STATISTICS

Statistical Properties of OLS estimators

Regression with quadratic loss

Lecture 5. Materials Covered: Chapter 6 Suggested Exercises: 6.7, 6.9, 6.17, 6.20, 6.21, 6.41, 6.49, 6.52, 6.53, 6.62, 6.63.

Lecture 7: October 18, 2017

Goodness-Of-Fit For The Generalized Exponential Distribution. Abstract

Statistics 511 Additional Materials

Confidence Level We want to estimate the true mean of a random variable X economically and with confidence.

Comparing Two Populations. Topic 15 - Two Sample Inference I. Comparing Two Means. Comparing Two Pop Means. Background Reading

Statistics 3858 : Likelihood Ratio for Multinomial Models

Simulation. Two Rule For Inverting A Distribution Function

10-701/ Machine Learning Mid-term Exam Solution

Stat 200 -Testing Summary Page 1

Lecture 12: November 13, 2018

Big Picture. 5. Data, Estimates, and Models: quantifying the accuracy of estimates.

Module 1 Fundamentals in statistics

This is an introductory course in Analysis of Variance and Design of Experiments.

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

Lecture 6 Simple alternatives and the Neyman-Pearson lemma

Empirical Process Theory and Oracle Inequalities

t distribution [34] : used to test a mean against an hypothesized value (H 0 : µ = µ 0 ) or the difference

MOST PEOPLE WOULD RATHER LIVE WITH A PROBLEM THEY CAN'T SOLVE, THAN ACCEPT A SOLUTION THEY CAN'T UNDERSTAND.

Chapter 6 Principles of Data Reduction

Intro to Learning Theory

A statistical method to determine sample size to estimate characteristic value of soil parameters

Element sampling: Part 2

Overview. p 2. Chapter 9. Pooled Estimate of. q = 1 p. Notation for Two Proportions. Inferences about Two Proportions. Assumptions

Problems from 9th edition of Probability and Statistical Inference by Hogg, Tanis and Zimmerman:

Chapter 8: Estimating with Confidence

Instructor: Judith Canner Spring 2010 CONFIDENCE INTERVALS How do we make inferences about the population parameters?

INF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification

6.3 Testing Series With Positive Terms

Exponential Families and Bayesian Inference

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. Comments:

Power and Type II Error

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER / Statistics

Class 27. Daniel B. Rowe, Ph.D. Department of Mathematics, Statistics, and Computer Science. Marquette University MATH 1700

MA Advanced Econometrics: Properties of Least Squares Estimators

Chi-Squared Tests Math 6070, Spring 2006

CSE 527, Additional notes on MLE & EM

Describing the Relation between Two Variables

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

This chapter focuses on two experimental designs that are crucial to comparative studies: (1) independent samples and (2) matched pair samples.

Physics 324, Fall Dirac Notation. These notes were produced by David Kaplan for Phys. 324 in Autumn 2001.

Lecture Note 8 Point Estimators and Point Estimation Methods. MIT Spring 2006 Herman Bennett

1 Last time: similar and diagonalizable matrices

Chapter 20. Comparing Two Proportions. BPS - 5th Ed. Chapter 20 1

Introduction to Probability. Ariel Yadin

Transcription:

Sectio 12 Tests of idepedece ad homogeeity I this lecture we will cosider a situatio whe our observatios are classified by two differet features ad we would like to test if these features are idepedet For example, we ca ask if the umber of childre i a family ad family icome are idepedet Our sample space X will cosist of a b pairs X = {(i, j) : i = 1,, a, j = 1,, b} where the first coordiate represets the first feature that belogs to oe of a categories ad the secod coordiate represets the secod feature that belogs to oe of b categories A iid sample X 1,, X ca be represeted by a cotigecy table below where N ij is the umber all observatios i a cell (i, j) Table 121: Cotigecy table Feature 2 Feature 1 1 2 b 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b a N a1 N a2 N ab We would like to test the idepedece of two features which meas that P(X = (i, j)) = P(X 1 = i)p(x 2 = j) If we itroduce the otatios P(X = (i, j)) = α ij, P(X 1 = i) = p i ad P(X 2 = j) = q j, 77

the we wat to test that for all i ad j we have α ij = p i q j Therefore, our hypotheses ca be formulated as follows: H 0 : α ij = p i q j for all (i, j) for some (p 1,, p a ) ad (q 1,, q b ) H 1 : otherwise We ca see that this ull hypothesis H 0 is a special case of the composite hypotheses from previous lecture ad it ca be tested usig the chi-squared goodess-of-fit test The total umber of groups is r = a b Sice p i s ad q j s should add up to oe p 1 + + p a = 1 ad q 1 + + q b = 1 oe parameter i each sequece, for example p a ad q b, ca be computed i terms of other probabilities ad we ca take (p 1,, p a 1 ) ad (q 1,, q b 1 ) as free parameters of the model This meas that the dimesio of the parameter set is s = (a 1) + (b 1) Therefore, if we fid the maximum likelihood estimates for the parameters of this model the the chi-squared statistic: T = (N ij p i q j ) 2 = χ 2 = χ 2 p i q j χ r 2 s 1 ab (a 1) (b 1) 1 (a 1)(b 1) i,j coverges i distributio to χ 2 (a 1)(b 1) distributio with (a 1)(b 1) degrees of freedom To formulate the test it remais to fid the maximum likelihood estimates of the parameters We eed to maximize the likelihood fuctio (pi q j ) N ij = P j p N ij Pi q N ij j = N p i+ N +j i qj i i,j i j i j where we itroduced the otatios N i+ = N ij ad N +j = N ij j for the total umber of observatios i the ith row ad jth colum Sice p i s ad q j s are ot related to each other, maximizig the likelihood fuctio above is equivalet to maxi- N i+ N +j N mizig i+ i p i ad j q j separately Let us maximize a i=1 p i or, takig the logarithm, maximize a a 1 N i+ log p i = N i+ log p i + N a+ log(1 p 1 p a ), i=1 i=1 sice the probabilities add up to oe Settig derivative i p i equal to zero, we get N i+ N a+ N i+ N a+ = = 0 p i 1 p1 p a 1 p a p i i 78

or N i+ p a = N a+ p i Addig up these equatios for all i a gives Therefore, we get that the MLE for p i : Similarly, the MLE for q j is: N a+ N i+ p a = N a+ = p a = = p i = N i+ p i = q j = N +j Therefore, chi-square statistic T i this case ca be writte as (Nij N i+ N +j /) 2 T = N i+ N +j / i,j ad the decisio rule is give by { δ = H 1 : T c H 2 : T > c where the threshold is determied from the coditio χ 2 (a 1)(b 1)(c, + ) = α Example I 1992 poll 189 Motaa residets were asked whether their persoal fiacial status was worse, the same or better tha oe year ago The opiios were divided ito three groups by icome rage: uder 20K, betwee 20K ad 35K, ad over 35K We would like to test if opiios were idepedet of icome Table 122: Motaa outlook poll b = 3 Worse Same Better 20 15 12 24 27 32 14 22 23 58 64 67 a = 3 20K (20K, 35K) 35K 47 83 59 189 The chi-squared statistic is (20 47 58/189) 2 (23 67 59/189) 2 T = + + = 521 47 58/189 67 59/189 79

If we take level of sigificace α = 005 the the threshold c is: χ 2 (a 1)(b 1) (c, + ) = χ 4 2 (c, ) = α = 005 c = 9488 Sice T = 521 < c = 9488 we accept the ull hypothesis that opiios are idepedet of icome Test of homogeeity Suppose that the populatio is divided ito R groups ad each group (or the etire populatio) is divided ito C categories We would like to test whether the distributio of categories i each group is the same Table 123: Test of homogeeity Category 1 Category C Group 1 N 11 N 1C Group R N R1 N RC N +1 N +C N 1+ N R+ If we deote so that for each group i R we have P(Category j Group i ) = p ij C p ij = 1 j=1 the we wat to test the followig hypotheses: H 0 : p ij = p j for all groups i R H 1 : otherwise If observatios X 1,, X are sampled idepedetly from the etire populatio the homogeeity over groups is the same as idepedece of groups ad categories Ideed, if have homogeeity P(Category j Group i ) = P(Category j ) the we have P(Group i, Category j ) = P(Category j Group i )P(Group i ) = P(Category j )P(Group i ) which meas the groups ad categories are idepedet Aother way aroud, if we have idepedece the P(Category j Group i ) = = P(Group i, Category j ) P(Group i ) P(Category j )P(Group i ) = P(Category P(Group i ) j ) 80

which is homogeeity This meas that to test homogeeity we ca use the test of idepedece above Iterestigly, the same test ca be used i the case whe the samplig is doe ot from the etire populatio but from each group separately which meas that we decide a priori about the sample size i each group - N 1+,, N R+ Whe we sample from the etire populatio these umbers are radom ad by the LLN N i+ / will approximate the probability P(Group i ), ie N i+ reflects the proportio of group i i the populatio Whe we pick these umbers a priori oe ca simply thik that we artificially reormalize the proportio of each group i the populatio ad test for homogeeity amog groups as idepedece i this ew artificial populatio Aother way to argue that the test will be the same is as follows Assume that P(Category j Group i ) = p j where the probabilities p j are all give The by Pearso s theorem we have the covergece i distributio C (N ij N i+ p j ) 2 2 χ N i+ p C 1 j j=1 for each group i R which implies that R C (N ij N i+ p j ) 2 χ 2 R(C 1) N i+ p j i=1 j=1 sice the samples i differet groups are idepedet If ow we assume that probabilities p 1,, p C are ukow ad plug i the maximum likelihood estimates p j = N +j / the R C (N ij N i+ N +j /) 2 χ 2 R(C 1) (C 1) = χ 2 N i+ N +j / i=1 j=1 (R 1)(C 1) because we have C 1 free parameters p 1,, p C 1 ad estimatig each ukow parameter results i losig oe degree of freedom Example (Textbook, page 560) I this example, 100 people were asked whether the service provided by the fire departmet i the city was satisfactory Shortly after the survey, a large fire occured i the city Suppose that the same 100 people were asked whether they thought that the service provided by the fire departmet was satisfactory The result are i the followig table: Satisfactory Usatisfactory Before fire 80 20 After fire 72 28 Suppose that we would like to test whether the opiios chaged after the fire by usig a chi-squared test However, the iid sample cosisted of pairs of opiios of 100 people (X 1, X 2 ),, (X 1, X 2 ) 1 1 100 100 81

where the first coordiate/feature is a perso s opiio before the fire ad it belogs to oe of two categories { Satisfactory, Usatisfactory }, ad the secod coordiate/feature is a perso s opiio after the fire ad it also belogs to oe of two categories { Satisfactory, Usatisfactory } So the correct cotigecy table correspodig to the above data ad satisfyig the assumptio of the chi-squared test would be the followig: Sat before Us before Sat after 70 10 Us after 2 18 I order to use the first cotigecy table, we would have to poll 100 people after the fire idepedetly of the 100 people polled before the fire 82