Exact Computation of Pearson Statistics Distribution and Some Experimental Results

Similar documents
Pre-Calculus Midterm Practice Test (Units 1 through 3)

j=1 π j = 1. Let X j be the number

Section 4.1 Polynomial Functions and Models. Copyright 2013 Pearson Education, Inc. All rights reserved

STATISTICS; An Introductory Analysis. 2nd hidition TARO YAMANE NEW YORK UNIVERSITY A HARPER INTERNATIONAL EDITION

Further generalizations of the Fibonacci-coefficient polynomials

TUTORIAL 8 SOLUTIONS #

Ling 289 Contingency Table Statistics

No class on Thursday, October 1. No office hours on Tuesday, September 29 and Thursday, October 1.

Institute of Actuaries of India

Limits and Continuity

Sources of randomness

Stat 5101 Notes: Brand Name Distributions

Let (Ω, F) be a measureable space. A filtration in discrete time is a sequence of. F s F t

In other words, we are interested in what is happening to the y values as we get really large x values and as we get really small x values.

RENEWAL THEORY STEVEN P. LALLEY UNIVERSITY OF CHICAGO. X i

each nonabsorbing state to each absorbing state.

Analyzing Rational Functions

Stat 5421 Lecture Notes Proper Conjugate Priors for Exponential Families Charles J. Geyer March 28, 2016

Vector Spaces and Subspaces

MICHIGAN STANDARDS MAP for a Basic Grade-Level Program. Grade Eight Mathematics (Algebra I)

2.3 Analysis of Categorical Data

Longest runs in coin tossing. Teaching recursive formulae, asymptotic theorems and computer simulations

Summary of Chapters 7-9

Classroom Activity 7 Math 113 Name : 10 pts Intro to Applied Stats

Common Core Algebra 2 Review Session 1

An Algebraic Interpretation of the Multiplicity Sequence of an Algebraic Branch

Discrete uniform limit law for additive functions on shifted primes

Statistics Handbook. All statistical tables were computed by the author.

6.1 Polynomial Functions

Math Lecture 4 Limit Laws

1 Continuity and Limits of Functions

An ordinal number is used to represent a magnitude, such that we can compare ordinal numbers and order them by the quantity they represent.

HANDBOOK OF APPLICABLE MATHEMATICS

Markov Chains. X(t) is a Markov Process if, for arbitrary times t 1 < t 2 <... < t k < t k+1. If X(t) is discrete-valued. If X(t) is continuous-valued

The purpose of this section is to derive the asymptotic distribution of the Pearson chi-square statistic. k (n j np j ) 2. np j.

Solving the Poisson Disorder Problem

Math 115 Spring 11 Written Homework 10 Solutions

2.2. Limits Involving Infinity. Copyright 2007 Pearson Education, Inc. Publishing as Pearson Prentice Hall

23.4. Convergence. Introduction. Prerequisites. Learning Outcomes

Week 1 Quantitative Analysis of Financial Markets Distributions A

Convergence of greedy approximation I. General systems

Statistics 150: Spring 2007

hp calculators HP 50g Probability distributions The MTH (MATH) menu Probability distributions

February 13, Option 9 Overview. Mind Map

PreCalculus. Curriculum (447 topics additional topics)

Probability Theory, Random Processes and Mathematical Statistics

Practice Test - Chapter 2

88 CONTINUOUS MARKOV CHAINS

Quiz 3 Reminder and Midterm Results

Poisson Processes. Stochastic Processes. Feb UC3M

The Exciting Guide To Probability Distributions Part 2. Jamie Frost v1.1

Markov Chains. As part of Interdisciplinary Mathematical Modeling, By Warren Weckesser Copyright c 2006.

EC2001 Econometrics 1 Dr. Jose Olmo Room D309

Lecture 20: Further graphing

Algebra 2 Early 1 st Quarter

Eleventh Problem Assignment

Hypothesis Tests and Estimation for Population Variances. Copyright 2014 Pearson Education, Inc.

Application of Variance Homogeneity Tests Under Violation of Normality Assumption

MATH 1040 Objectives List

RENEWAL THEORY STEVEN P. LALLEY UNIVERSITY OF CHICAGO. X i

x 2 + 6x 18 x + 2 Name: Class: Date: 1. Find the coordinates of the local extreme of the function y = x 2 4 x.

Test Code: STA/STB (Short Answer Type) 2013 Junior Research Fellowship for Research Course in Statistics

Random Variables Example:

Identify the graph of a function, and obtain information from or about the graph of a function.

Alg Review/Eq & Ineq (50 topics, due on 01/19/2016)

SUMMATION TECHNIQUES

Mathematics for Business and Economics - I. Chapter 5. Functions (Lecture 9)

5. Conditional Distributions

Round. Notations. Primary definition. Specific values. Traditional name. Traditional notation. Mathematica StandardForm notation. Specialized values

THE STUDY OF THE SET OF NASH EQUILIBRIUM POINTS FOR ONE TWO-PLAYERS GAME WITH QUADRATIC PAYOFF FUNCTIONS

K e sub x e sub n s i sub o o f K.. w ich i sub s.. u ra to the power of m i sub fi ed.. a sub t to the power of a

Grade Math (HL) Curriculum

Lesson 3 Algebraic expression: - the result obtained by applying operations (+, -,, ) to a collection of numbers and/or variables o

Mathematics for Engineers. Numerical mathematics

Chapte The McGraw-Hill Companies, Inc. All rights reserved.

Introduction. log p θ (y k y 1:k 1 ), k=1

ON SINGULAR PERTURBATION OF THE STOKES PROBLEM

Name: Firas Rassoul-Agha

2.4 The Precise Definition of a Limit

Functions, Graphs, Equations and Inequalities

The Sommerfeld Polynomial Method: Harmonic Oscillator Example

2 DISCRETE-TIME MARKOV CHAINS

ID: ID: ID: of 39 1/18/ :43 AM. Student: Date: Instructor: Alfredo Alvarez Course: 2017 Spring Math 1314

Distance between multinomial and multivariate normal models

CH.9 Tests of Hypotheses for a Single Sample

Calculus I Practice Test Problems for Chapter 2 Page 1 of 7

Algebra II Learning Targets

Fundamentals of Mathematics I

40.2. Interval Estimation for the Variance. Introduction. Prerequisites. Learning Outcomes

Third Grade Report Card Rubric 1 Exceeding 2 Meeting 3 Developing 4 Area of Concern

Mean-field dual of cooperative reproduction

Week 12: Optimisation and Course Review.

Topics Covered in Math 115

Preface Introduction to Statistics and Data Analysis Overview: Statistical Inference, Samples, Populations, and Experimental Design The Role of

Infinite Limits. Infinite Limits. Infinite Limits. Previously, we discussed the limits of rational functions with the indeterminate form 0/0.

Polynomial Functions of Higher Degree

SECTION 2.3: LONG AND SYNTHETIC POLYNOMIAL DIVISION

Curriculum Scope and Sequence

Exponential martingales: uniform integrability results and applications to point processes

The Multinomial Model

Transcription:

AUSTRIAN JOURNAL OF STATISTICS Volume 37 (28), Number 1, 129 135 Exact Computation of Pearson Statistics Distribution and Some Experimental Results Marina V. Filina and Andrew M. Zubkov Steklov Mathematical Institute of RAS, Moscow, Russia Abstract: A Markov chain based algorithms for exact and approximate computation of Pearson statistics distribution for multinomial scheme are described. Results of computational experiments reveal some new properties of the difference between this distribution and corresponding chi-square distribution. Keywords: Approximation of Probability Distributions. 1 Introduction The Pearson statistics for multinomial scheme and its modifications is used by different goodness-of-fit criteria. As a rule the selection of critical values for Pearson statistics is based on the convergence of its distribution to the χ 2 distribution with appropriate degrees of freedom as sample size tends to infinity. In practice sample sizes are bounded, and the question on the accuracy of such approximation (especially for distribution tails) arises naturally. Results of investigation of this problem was reported, in particular, in Holzman and Good (1986) where more than 25 examples of equiprobable multinomial scheme with N [2, 16] outcomes and sample sizes T [1, 8] were considered. To compute the distribution function of Pearson statistics Holzman and Good (1986) have used generating functions and Good, Gover, and Mitchell (197) Fast Fourier Transform. Computational method for decomposable statistics distribution (also based on generating functions) was proposed in Selivanov (26). We propose to compute the Pearson statistics distribution by means of the Markov chain method suggested in Zubkov (1996, 22); this method may be applied to distributions of decomposable statistics for multinomial and some other schemes also. 2 Method Let ν 1,..., ν N be frequencies of N outcomes with probabilities p 1,..., p N in a multinomial sample of size T. Random variables of the form ζ = N f j(ν j ), where f 1 (x),..., f N (x) are given functions, are called decomposable statistics. In the case f j (x) = (x T p j ) 2 /T p j, j = 1,..., N, we obtain the Pearson statistics X 2 N,T = N (ν j T p j ) 2 T p j. (1)

13 Austrian Journal of Statistics, Vol. 37 (28), No. 1, 129 135 If the hypothetical probabilities p j = m j /n j, j = 1,..., N, are rational then the formula for the Pearson statistics may be rewritten as XN,T 2 = 1 T ˆM ˆN N ˆM m j ˆN n j (n j ν j m j T ) 2, (2) where ˆM = LCM(m 1,..., m N ), ˆN = LCM(n1,..., n N ). If all hypothetical probabilities are equal (p 1 = = p N = 1/N) then the formula for the Pearson statistics may be represented in another form as ( N XN,T 2 (ν j T/N) 2 = = N N ( ) 2 ( T T ν j N T ) ) 2, (3) T/N T N N N where x = [x + 1/2] denotes the nearest integer to x. Formulas (2) and (3) reduce the computation of the Pearson statistics distribution to the one of integer-valued decomposable statistics. Exact distributions of integer-valued random variables may be stored as tables in a computer memory. Further, the conditional distribution of the frequency ν t on the set { t 1 ν j = u } def coincides with the binomial distribution Bin(T u, p t /P t ), P t = p t + + p N ; so def the sequence κ =, κ t = t ν j, t = 1,..., N, may be considered as a timeinhomogeneous Markov chain with state space {, 1,..., T } and transition probabilities ( ) v u ( p t (v u) = P{κ t = v κ t 1 = u} = C v u pt T u 1 p ) T v t, u v T. P t P t (4) So the sequences ( t t ( ) ) 2 T ζ = (, ), ζt = ν j, ν j, t = 1,..., N, (5) N ( t ζ = (, ), ζ t = ν j, t ˆM m j ˆN n j (n j ν j m j T ) 2 ), t = 1,..., N, (6) def (being additive functions of {κ t }) are finite nonhomogeneous Markov chains ζ t = (ζ t,1, ζ t,2 ) and ζt def = (ζt,1, ζt,2) with transition probabilities { ( ( ) ) } 2 T P ζt = v, s + v u N ζ t 1 = (u, s) { ( = P ζ t = v, s + ˆM ) } ˆN (n t (v u) m t T ) 2 m t n t ζ t 1 = (u, s) = p t (v u) (7) for u v T and in other cases. It is obvious that ζ N = (T, ζ N,2 ) a.s., and the distribution of (N/T ) ( ζ N,2 N ( (T/N) (T/N))2) coincides with that of X 2 N,T from

M. Filina and A. Zubkov 131 (3); analogously, ζ N = (T, ζ N,2 ) a.s., and the distribution of 1/(T ˆM ˆN)ζ N,2 coincides with that of XN,T 2 from (2). It follows that we may find the distribution of X2 N,T by means of a recursive computation of the distributions of ζk (or ζ k), k = 1,..., N. Note that probabilities p 1,..., p N in the definition (4) of the chain κ t transition probabilities need not be equal to the hypothetical probabilities in the formula for Pearson statistics. In other words, this approach may be applied to the computation of Pearson statistics distribution under alternative hypotheses also. In the case of arbitrary probabilities p 1,..., p N we may use the same ideas to compute approximate distribution function of XN,T 2 by means of discretization. To this end it suffice to introduce functions h ε (x) = [1/2 + x/ε] (where ε >, [y] denotes integer part of y) and consider the discrete Markov chain ( t ) t ζ ε = (, ), ζt ε (ν j T p j ) 2 = ν j, h ε, t = 1,..., N, T p j with transition probabilities { ) } P ζt ε (v u T p t ) = (v, 2 s + h ε ζt 1 ε = (u, s) = p(v u), T p t where p(v u) may be defined by probabilities not necessarily coinciding with p 1,..., p N. Let, as earlier, ζ ε N = (T, ζε N,2 ). From the obvious estimate εh ε(x) x ε/2 we have crude bounds P {εζ ε N x Nε/2} P { X 2 N,T x } P {εζ ε N x + Nε/2}. Reducing the value of ε we may find arbitrary good estimates for P { X 2 N,T x}. The volume of memory used is inversely proportional to ε. This method was realized by several C++ programs. In particular, for equiprobable schemes the Pearson statistics distributions with the number of outcomes up to hundreds and with the number of trials up to thousands was computed. Computations on PC takes from seconds if number of outcomes and trials are less that 5 to minutes if these numbers are of the order of several hundreds. Time and memory requirements of a program realizing algorithm for rational probabilities depend heavily on their arithmetical structure. Time and memory requirements of a program for approximate computation are analogous to that of a program for equiprobable case. 3 Experimental Results Computational experiments reveal some interesting features of the difference between exact Pearson statistics distributions and corresponding chi-square distributions. The equiprobable case with N = 1, T = 1 is used in Figure 1 to explain the structure of all subsequent pictures. There are a piecewise-constant distribution function of exact Pearson statistics, a continuous distribution function of χ 2 distribution with 9 degrees of freedom, a discontinuous saw-like difference between two preceding functions and piecewise linear continuous function connecting mean values of the difference at

132 Austrian Journal of Statistics, Vol. 37 (28), No. 1, 129 135 discontinuity points ( average difference ) in the left part of Figure 1. In the following we consider plots of differences and average differences only (as in the right part of this figure). 1,,8,1,6,5,4 1 2 3 4,2 K,5 K,2 1 2 3 4 K,1 Figure 1: N = 1, T = 1, p = p 1 = = p 1 =.1 In Figure 2 we plot differences and average differences for equiprobable case with N = 1 outcomes and T = 1, T = 1 trials. Note that the shapes of plots are almost independent on T. The ranges of graphs are approximately inversely proportional to T. The shapes of average differences have a form of fading wave; the sign of average difference becomes negative on the right tail of distribution (after x 25), but eventually it becomes positive again (due to the boundedness of the Pearson statistics distribution).,1,1,5,5 1 2 3 4 1 2 3 4 K,5 K,5 K,1 K,1 Figure 2: Equiprobable cases, N = 1, T = 1 (left) and T = 1 (right) Plots of the differences for equiprobable cases with constant values of ratios T/N = 1 and T/N = 2 are shown in Figures 3 and 4, respectively. We can see that the shapes of

M. Filina and A. Zubkov 133 the differences slightly depend on N, that the shape of the average differences are more stable and that the ranges of the graphs are approximately inversely proportional to the square root of the number of outcomes. So large number of outcomes may compensate an insufficient number of trials even when the quotient T/N is as small as 2 (Figure 4).,1,3,2,5,1 1 2 3 4 2 4 6 8 1 12 14 16 18 K,5 K,1 K,2 K,1 K,3 Figure 3: Equiprobable cases, T = 1N, N = 1 (left), N = 1 (right),6,15,4,1,2,5 1 2 3 4 2 4 6 8 1 12 14 16 18 K,2 K,5 K,4 K,1 K,6 K,15 Figure 4: Equiprobable cases, T = 2N, N = 1 (left), N = 1 (right) As long as the distribution on the set of outcomes becomes more non-uniform the shape of plots of differences approaches the shape of plots of average differences, namely, the shape of fading wave with two minima and one maximum (see Figures 5, 6). Values of these extremes are comparable with the extremum values of average differences for corresponding equiprobable cases. The reason of shrinking the differences to the average differences when the distribution of the outcomes goes away from a uniform one may be explained (on the heuristic level) as follows.

134 Austrian Journal of Statistics, Vol. 37 (28), No. 1, 129 135,6,6,4,4,2,2 1 2 3 4 1 2 3 4 K,2 K,2 K,4 K,4 K,6 K,6 Figure 5: N = 1, T = 1, p = (.5,.1,.1,.1,.1,.1,.1,.1,.1,.15) (left), p = (.5,.5,.5,.5,.5,.5,.1,.15,.2,.25) (right),6,6,4,4,2,2 1 2 3 4 1 2 3 4 K,2 K,2 K,4 K,4 K,6 K,6 Figure 6: N = 1, T = 1, p = (.4,.4,.6,.8,.1,.12,.14,.14,.14,.14) (left), p = (.2,.4,.6,.8,.1,.1,.12,.14,.16,.18) (right) According to (1) the support of the Pearson statistics distribution is contained in the set of nonnegative numbers having representation N k2 j /T p j T, where k 1,..., k N are nonnegative integer numbers. If the probabilities p 1 = c 1 /d 1,..., p N = c N /d N > are rational numbers and M = LCM(c 1,..., c N ) then this set is contained in the arithmetical progression {k/t M, k =, 1,... }. The distribution function F N,T (x) = P{XN,T 2 < x} is approximated by the distribution function of a χ 2 distribution with N 1 degrees of freedom, so the smaller the value T M the greater the weights of atoms of distribution XN,T 2, i.e. jumps of piecewise-constant distribution function F N,T (x) (in particular, the largest atoms appear in the equiprobable case). Therefore, the accuracy of approximation F N,T (x) by distribution function of χ 2 statistics with N 1 degrees of freedom cannot be

M. Filina and A. Zubkov 135 good for small values of T M. We hope to find quantitative theoretical explanation of effects described. Acknowledgements This research was partly supported by the Leading Scientific Schools Support Fund of the President of Russia Federation, grant 4129.26.1, and by the Program of RAS Modern Problems of Theoretical Mathematics. References Good, I. J., Gover, T. N., and Mitchell, G. J. (197). Exact distributions for χ 2 and for likelihood-ratio statistic for the equiprobable multinomial distribution. Journal of the American Statistical Association, 65, 267-283. Holzman, G. I., and Good, I. J. (1986). The Poisson and chi-squared approximation as compared with the true upper-tail probability of Pearson s χ 2 for equiprobable multinomials. Journal of Statistical Planning and Inference, 13, 283-295. Selivanov, B. I. (26). On the exact computation of decomposable statistics distributions for polynomial scheme (in russian). Diskretnaya matematika, 18, 85-94. Zubkov, A. M. (1996). Recurrent formulae for distributions of functions of discrete random variables (in russian). Obozr. prikl. prom. matem., 3, 567-573. Zubkov, A. M. (22). Computational methods for distributions of sums of random variables (in russian). In Trudy po diskretnoi matematike (Vol. 5, p. 51-6). Moscow: Fismatlit. Corresponding Authors Addresses: Andrew M. Zubkov Department of Discrete Mathematics Steklov Mathematical Institute Gubkina Str. 8 119991 Moscow Russia E-mail: MFilina@mi.ras.ru and zubkov@mi.ras.ru