Goodness-of-fit for composite hypotheses.

Similar documents
Multiple Experts with Binary Features

Pearson s Chi-Square Test Modifications for Comparison of Unweighted and Weighted Histograms and Two Weighted Histograms

3.1 Random variables

Math 151. Rumbos Spring Solutions to Assignment #7

Lecture 28: Convergence of Random Variables and Related Theorems

New problems in universal algebraic geometry illustrated by boolean equations

Random Variables and Probability Distribution Random Variable

Surveillance Points in High Dimensional Spaces

Method for Approximating Irrational Numbers

The Substring Search Problem

Introduction to Mathematical Statistics Robert V. Hogg Joeseph McKean Allen T. Craig Seventh Edition

CSCE 478/878 Lecture 4: Experimental Design and Analysis. Stephen Scott. 3 Building a tree on the training set Introduction. Outline.

6 Matrix Concentration Bounds

6 PROBABILITY GENERATING FUNCTIONS

q i i=1 p i ln p i Another measure, which proves a useful benchmark in our analysis, is the chi squared divergence of p, q, which is defined by

ON INDEPENDENT SETS IN PURELY ATOMIC PROBABILITY SPACES WITH GEOMETRIC DISTRIBUTION. 1. Introduction. 1 r r. r k for every set E A, E \ {0},

Web-based Supplementary Materials for. Controlling False Discoveries in Multidimensional Directional Decisions, with

4/18/2005. Statistical Learning Theory

16 Modeling a Language by a Markov Process

Alternative Tests for the Poisson Distribution

Internet Appendix for A Bayesian Approach to Real Options: The Case of Distinguishing Between Temporary and Permanent Shocks

1D2G - Numerical solution of the neutron diffusion equation

1) (A B) = A B ( ) 2) A B = A. i) A A = φ i j. ii) Additional Important Properties of Sets. De Morgan s Theorems :

THE NUMBER OF TWO CONSECUTIVE SUCCESSES IN A HOPPE-PÓLYA URN

15 Solving the Laplace equation by Fourier method

9.1 The multiplicative group of a finite field. Theorem 9.1. The multiplicative group F of a finite field is cyclic.

10/04/18. P [P(x)] 1 negl(n).

763620SS STATISTICAL PHYSICS Solutions 2 Autumn 2012

Unobserved Correlation in Ascending Auctions: Example And Extensions

Multiple Criteria Secretary Problem: A New Approach

Physics 121 Hour Exam #5 Solution

A Relativistic Electron in a Coulomb Potential

MODULE 5a and 5b (Stewart, Sections 12.2, 12.3) INTRO: In MATH 1114 vectors were written either as rows (a1, a2,..., an) or as columns a 1 a. ...

Conservative Averaging Method and its Application for One Heat Conduction Problem

ONE-POINT CODES USING PLACES OF HIGHER DEGREE

Stanford University CS259Q: Quantum Computing Handout 8 Luca Trevisan October 18, 2012

1. Review of Probability.

LET a random variable x follows the two - parameter

Lecture 8 - Gauss s Law

Nuclear Medicine Physics 02 Oct. 2007

Temporal-Difference Learning

Chapter 3: Theory of Modular Arithmetic 38

Markscheme May 2017 Calculus Higher level Paper 3

Central Coverage Bayes Prediction Intervals for the Generalized Pareto Distribution

Bounds on the performance of back-to-front airplane boarding policies

Graphs of Sine and Cosine Functions

Suggested Solutions to Homework #4 Econ 511b (Part I), Spring 2004

A Bijective Approach to the Permutational Power of a Priority Queue

Information Retrieval Advanced IR models. Luca Bondi

Quasi-Randomness and the Distribution of Copies of a Fixed Graph

Math 124B February 02, 2012

Homework 7 Solutions

Absorption Rate into a Small Sphere for a Diffusing Particle Confined in a Large Sphere

arxiv: v1 [math.co] 1 Apr 2011

Chapter 2: Introduction to Implicit Equations

Notes on McCall s Model of Job Search. Timothy J. Kehoe March if job offer has been accepted. b if searching

Solution to HW 3, Ma 1a Fall 2016

A New Method of Estimation of Size-Biased Generalized Logarithmic Series Distribution

n 1 Cov(X,Y)= ( X i- X )( Y i-y ). N-1 i=1 * If variable X and variable Y tend to increase together, then c(x,y) > 0

arxiv: v1 [math.co] 4 May 2017

PHYSICS 4E FINAL EXAM SPRING QUARTER 2010 PROF. HIRSCH JUNE 11 Formulas and constants: hc =12,400 ev A ; k B. = hf " #, # $ work function.

A Multivariate Normal Law for Turing s Formulae

On the Poisson Approximation to the Negative Hypergeometric Distribution

Auchmuty High School Mathematics Department Advanced Higher Notes Teacher Version

ST 501 Course: Fundamentals of Statistical Inference I. Sujit K. Ghosh.

EM Boundary Value Problems

PHYS 301 HOMEWORK #10 (Optional HW)

SUFFICIENT CONDITIONS FOR MAXIMALLY EDGE-CONNECTED AND SUPER-EDGE-CONNECTED GRAPHS DEPENDING ON THE CLIQUE NUMBER

Review: Electrostatics and Magnetostatics

( ) [ ] [ ] [ ] δf φ = F φ+δφ F. xdx.

MATH 220: SECOND ORDER CONSTANT COEFFICIENT PDE. We consider second order constant coefficient scalar linear PDEs on R n. These have the form

A NEW VARIABLE STIFFNESS SPRING USING A PRESTRESSED MECHANISM

Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline. Machine. Learning. Problems. Measuring. Performance.

ON THE INVERSE SIGNED TOTAL DOMINATION NUMBER IN GRAPHS. D.A. Mojdeh and B. Samadi

Then the number of elements of S of weight n is exactly the number of compositions of n into k parts.

Numerical Integration

A thermodynamic degree of freedom solution to the galaxy cluster problem of MOND. Abstract

8 Separation of Variables in Other Coordinate Systems

A THREE CRITICAL POINTS THEOREM AND ITS APPLICATIONS TO THE ORDINARY DIRICHLET PROBLEM

B. Spherical Wave Propagation

1 Explicit Explore or Exploit (E 3 ) Algorithm

C/CS/Phys C191 Shor s order (period) finding algorithm and factoring 11/12/14 Fall 2014 Lecture 22

Reliability analysis examples

Probablistically Checkable Proofs

Lecture 18: Graph Isomorphisms

Hypothesis Test and Confidence Interval for the Negative Binomial Distribution via Coincidence: A Case for Rare Events

Physics 221 Lecture 41 Nonlinear Absorption and Refraction

arxiv: v2 [physics.data-an] 15 Jul 2015

GENLOG Multinomial Loglinear and Logit Models

I. CONSTRUCTION OF THE GREEN S FUNCTION

Gauss s Law Simulation Activities

working pages for Paul Richards class notes; do not copy or circulate without permission from PGR 2004/11/3 10:50

On the integration of the equations of hydrodynamics

Lecture 7 Topic 5: Multiple Comparisons (means separation)

Exploration of the three-person duel

KOEBE DOMAINS FOR THE CLASSES OF FUNCTIONS WITH RANGES INCLUDED IN GIVEN SETS

arxiv: v1 [physics.gen-ph] 18 Aug 2018

Solution to Problem First, the firm minimizes the cost of the inputs: min wl + rk + sf

Contact impedance of grounded and capacitive electrodes

Topic 5. Mean separation: Multiple comparisons [ST&D Ch.8, except 8.3]

Transcription:

Section 11 Goodness-of-fit fo composite hypotheses. Example. Let us conside a Matlab example. Let us geneate 50 obsevations fom N(1, 2): X=nomnd(1,2,50,1); Then, unning a chi-squaed goodness-of-fit test chi2gof [H,P,STATS]= chi2gof(x) outputs H = 0, P = 0.8793, STATS = chi2stat: 0.6742 df: 3 edges: [-3.7292-0.9249 0.0099 0.9447 1.8795 2.8142 5.6186] O: [8 7 8 8 9 10] E: [8.7743 7.0639 8.7464 8.8284 7.2645 9.3226] The test accepts the hypothesis that the data is nomal. Notice, howeve, that something is diffeent. Matlab gouped the data into 6 intevals, so chi-squaed test fom pevious lectue should have 1 = 6 1 = 5 degees of feedom, but we have df: 3! The diffeence is that now ou hypothesis is not that the data comes fom a paticula given distibution but that the data comes fom a family of distibutions which is called a composite hypothesis. Running [H,P,STATS]= chi2gof(x, cdf,@(z)nomcdf(z,mean(x),std(x,1))) would test a simple hypothesis that the data comes fom a paticula nomal distibution N(ˆµ, χˆ2) and the output H = 0, P = 0.9838 STATS = chi2stat: 0.6842 71

df: 5 edges: [-3.7292-0.9249 0.0099 0.9447 1.8795 2.8142 5.6186] O: [8 7 8 8 9 10] E: [8.6525 7.0995 8.8282 8.9127 7.3053 9.2017] has df: 5. Howeve, we can not use this test because we estimate the paametes ˆµ and ˆχ 2 of this distibution using the data so this is not a paticula given distibution; in fact, this is the distibution that fits the data the best, so the T statistic in Peason s theoem will behave diffeently. Let us stat with a discete case when a andom vaiable takes a finite numbe of values B 1,..., B with pobabilities p 1 = P(X = B 1 ),..., p = P(X = B ). We would like to test a hypothesis that this distibution comes fom a family of distibutions {P θ : ν Θ}. In othe wods, if we denote we want to test p j (ν) = P θ (X = B j ), H 0 : p j = p j (ν) fo all j fo some ν Θ H 1 : othewise. If we wanted to test H 0 fo one paticula fixed ν we could use the statistic (νj np j (ν)) 2 T =, np j (ν) and use a simple chi-squaed goodness-of-fit test. The situation now is moe complicated because we want to test if p j = p j (ν), j at least fo some ν Θ which means that we have many candidates fo ν. One way to appoach this poblem is as follows. (Step 1) Assuming that hypothesis H 0 holds, i.e. P = P θ fo some ν Θ, we can find an estimate ν of this unknown ν and then (Step 2) ty to test if, indeed, the distibution P is equal to P θ by using the statistics (νj np j (ν )) 2 T = np j (ν ) in chi-squaed goodness-of-fit test. This appoach looks natual, the only question is what estimate ν to use and how the fact that ν also depends on the data will affect the convegence of T. It tuns out that if we let ν be the maximum likelihood estimate, i.e. ν that maximizes the likelihood function ϕ(ν) = p 1 (ν) ν 1... p (ν) ν 72

then the statistic (ν j np j (ν )) 2 d T = ϕ 2 (11.0.1) np j (ν ) s 1 conveges to ϕ 2 s 1 distibution with s 1 degees of feedom, whee s is the dimension of the paamete set Θ. Of couse, hee we assume that s 2 so that we have at least one degee of feedom. Vey infomally, by dimension we undestand the numbe of fee paametes that descibe the set { } (p 1 (ν),..., p (ν)) : ν Θ. Then the decision ule will be { α = H 1 : T c H 2 : T > c whee the theshold c is detemined fom the condition P(α = H 0 H 0 ) = P(T > c H 0 ) ϕ 2 s 1(c, + ) = α whee α [0, 1] is the level of sidnificance. Example 1. Suppose that a gene has two possible alleles A 1 and A 2 and the combinations of these alleles define thee genotypes A 1 A 1, A 1 A 2 and A 2 A 2. We want to test a theoy that Pobability to pass A 1 to a child = ν Pobability to pass A 2 to a child = 1 ν and that the pobabilities of genotypes ae given by p 1 (ν) = P(A 1 A 1 ) = ν 2 p 2 (ν) = P(A 1 A 2 ) = 2ν(1 ν) (11.0.2) p 3 (ν) = P(A 2 A 2 ) = (1 ν) 2. Suppose that given a andom sample X 1,..., X n fom the population the counts of each genotype ae ν 1, ν 2 and ν 3. To test the theoy we want to test the hypothesis H 0 : p 1 = p 1 (ν), p 2 = p 2 (ν), p 3 = p 3 (ν) fo some ν [0, 1] H 1 : othewise. Fist of all, the dimension of the paamete set is s = 1 since the distibutions ae detemined by one paamete ν. To find the MLE ν we have to maximize the likelihood function o, equivalently, maximize the log-likelihood p 1 (ν) ν 1 p 2 (ν) ν 2 p 3 (ν) ν 3 log p 1 (ν) ν 1 p 2 (ν) ν 2 p 3 (ν) ν 3 = ν 1 log p 1 (ν) + ν 2 log p 2 (ν) + ν 3 log p 3 (ν) = ν 1 log ν 2 + ν 2 log 2ν(1 ν) + ν 3 log(1 ν) 2. 73

If we compute the citical point by setting the deivative equal to 0, we get 2ν 1 + ν 2 ν =. 2n Theefoe, unde the null hypothesis H 0 the statistic T = (ν 1 np 1 (ν )) 2 (ν 2 np 2 (ν )) 2 (ν 3 np 3 (ν )) 2 + + np 1 (ν ) np 2 (ν ) np 3 (ν ) d ϕ 2 s 1 = ϕ 2 3 1 1 = ϕ1 2 conveges to ϕ 2 1-distibution with one degee of feedom. Theefoe, in the decision ule { H α = 1 : T c H 2 : T > c theshold c is detemined by the condition Fo example, if α = 0.05 then c = 3.841. P(α = H 0 H 0 ) ϕ 1 2 (T > c) = α. Example 2. A blood type O, A, B, AB is detemined by a combination of two alleles out of A, B, O and allele O is dominated by A and B. Suppose that p, q and = 1 p q ae the population fequencies of alleles A, B and O coespondingly. If alleles ae passed andomly fom the paents then the pobabilities of blood types will be Blood type Allele combinations Pobabilities Counts O OO 2 ν 1 = 121 A AA, AO p 2 + 2p ν 2 = 120 B BB, BO q 2 + 2p ν 3 = 79 AB AB 2pq ν 4 = 33 We would like to test this theoy based on the counts of each blood type in a andom sample of 353 people. We have fou goups and two fee paametes p and q, so the chi-squaed statistics T unde the null hypotheses will have ϕ 2 4 2 1 = ϕ2 1 distibution with one degee of feedom. Fist, we have to find the MLE of paametes p and q. The log likelihood is ν 1 log 2 + ν 2 log(p 2 + 2p) + ν 3 log(q 2 + 2q) + ν 4 log(2pq) = 2ν 1 log(1 p q) + ν 2 log(2p p 2 2pq) + ν 3 log(2q q 2 2pq) + ν 4 log(2pq). Unfotunately, if we set the deivatives with espect to p and q equal to zeo, we get a system of two equations that is had to solve explicitly. So instead we can minimize log likelihood numeically to get the MLE ˆp = 0.247 and ˆq = 0.173. Plugging these into fomulas of blood type pobabilities we get the estimated pobabilities and estimated counts in each goup O A B AB ˆp i 0.3364 0.3475 0.2306 0.0855 nˆp i 118.7492 122.6777 81.4050 30.1681 74

We can now compute chi-squaed statistic T 0.44 and the p-value ϕ 2 (T, ) = 0.5071. The 1 data agees vey well with the above theoy. We could also use a simila test when the distibutions P θ, ν Θ ae not necessaily suppoted by a finite numbe of points B 1,..., B, fo example, continuous distibutions. In this case if we want to test the hypothesis H 0 : P = P θ fo some ν Θ we can goup the data into intevals I 1,..., I and test the hypothesis H 0 : p j = p j (ν) = P θ (X I j ) fo all j fo some ν. Fo example, if we discetize nomal distibution by gouping the data into intevals I 1,..., I then the hypothesis will be H 0 : p j = N(µ, χ 2 )(I j ) fo all j fo some (α, χ 2 ). Thee ae two fee paametes µ and χ 2 that descibe all these pobabilities so in this case s = 2. Matlab function chi2gof tests fo nomality by gouping the data and computing statistic T in (11.0.1) - that is why it uses ϕ 2 s 1 distibution with s 1 = 2 1 = 3 degees of feedom and, thus, df: 3 in the example above. Example. Let us test if the data nomtemp fom nomal body tempeatue dataset fits nomal distibution. [H,P,STATS]= chi2gof(nomtemp) gives H = 0, P = 0.0504 STATS = chi2stat: 9.4682 df: 4 edges: [1x8 double] O: [13 12 29 27 35 10 4] E: [9.9068 16.9874 27.6222 31.1769 24.4270 13.2839 6.5958] and we accept null hypothesis at the default level of significance α = 0.05 since p-value 0.0504 > α = 0.05. We have = 7 goups and, theefoe, s 1 = 7 2 1 = 4 degees of feedom. In the case when the distibutions P θ ae continuous o, moe geneally, have infinite numbe of values that must be gouped in ode to use chi-squaed test (fo example, nomal o Poisson distibution), it can be a difficult numeical poblem to maximize the gouped likelihood function P θ (I 1 ) ν 1... P θ (I ) ν max ν. 75 θ

It is tempting to use a usual non-gouped MLE νˆ of ν instead of the above ν because it is often easie to compute, in fact, fo many distibutions we know explicit fomulas fo these MLEs. Howeve, if we use νˆ in the statistic (νj np j (νˆ)) 2 T = (11.0.3) np j (νˆ) then it will no longe convege to ϕ 2 s 1 distibution. A famous esult in [1] poves that typically this T will convege to a distibution in between ϕ 2 s 1 and ϕ 2 1. Intuitively this is easy to undestand because ν specifically fits the gouped data ν 1,..., ν so the expected counts np 1 (ν ),..., np (ν ) should be a bette fit compaed to the expected counts np 1 (νˆ),..., np (νˆ). On the othe hand, these last expected counts should be a bette fit than simply using the tue expected counts np 1 (ν 0 ),..., np (ν 0 ) since the MLE νˆ fits the data bette than the tue distibution. So typically we would expect (νj np j (ν )) 2 (νj np j (νˆ)) 2 (νj np j (ν 0 )) 2. np j (ν ) np j (νˆ) But the left hand side conveges to ϕ 2 s 1 if the decision ule is based on the statistic (11.0.3): { α = np j (ν 0 ) and the ight hand side conveges to ϕ2 1. Thus, H 1 : T c H 2 : T > c then the theshold c can be detemined consevatively fom the tail of ϕ 2 1 distibution since P(α = H 0 H 0 ) = P(T > c) ϕ 2 1 (T > c) = α. Refeences: [1] Chenoff, Heman; Lehmann, E. L. (1954) The use of maximum likelihood estimates in ϕ 2 tests fo goodness of fit. Ann. Math. Statistics 25, pp. 579-586. 76