Geometry of Goodness-of-Fit Testing in High Dimensional Low Sample Size Modelling

Similar documents
arxiv: v1 [stat.me] 5 Apr 2013

Academic Editors: Frédéric Barbaresco and Frank Nielsen Received: 31 August 2016; Accepted: 19 November 2016; Published: 24 November 2016

arxiv: v1 [math.st] 10 Sep 2012

Towards the Geometry of Model Sensitivity: An Illustration

HANDBOOK OF APPLICABLE MATHEMATICS

Stat 5101 Lecture Notes

New Bayesian methods for model comparison

Subject CS1 Actuarial Statistics 1 Core Principles

Theory and Methods of Statistical Inference. PART I Frequentist theory and methods

Testing Statistical Hypotheses

Local Mixtures of the Negative Exponential Distribution

Theory and Methods of Statistical Inference. PART I Frequentist likelihood methods

Dependence. Practitioner Course: Portfolio Optimization. John Dodson. September 10, Dependence. John Dodson. Outline.

An Introduction to Differential Geometry in Econometrics

Discrete Multivariate Statistics

Goodness-of-Fit Tests for the Ordinal Response Models with Misspecified Links

Institute of Actuaries of India

Theory and Methods of Statistical Inference

Testing Statistical Hypotheses

8 Nominal and Ordinal Logistic Regression

Multivariate Distributions

Local Mixtures and Exponential Dispersion Models

ASSESSING A VECTOR PARAMETER

Dependence. MFM Practitioner Module: Risk & Asset Allocation. John Dodson. September 11, Dependence. John Dodson. Outline.

A Closer Look at the Hill Estimator: Edgeworth Expansions and Confidence Intervals

Recall that in order to prove Theorem 8.8, we argued that under certain regularity conditions, the following facts are true under H 0 : 1 n

Continuous Univariate Distributions

Previous lecture. Single variant association. Use genome-wide SNPs to account for confounding (population substructure)

arxiv: v1 [math.st] 22 Jun 2018

Likelihood Inference in Exponential Families and Generic Directions of Recession

Upon completion of this chapter, you should be able to:

The Nonparametric Bootstrap

Information Geometry on Hierarchy of Probability Distributions

PRINCIPLES OF STATISTICAL INFERENCE

Categorical Data Analysis Chapter 3

Local Mixture Models of Exponential Families

Applied Mathematics Research Report 07-08

The Mixture Approach for Simulating New Families of Bivariate Distributions with Specified Correlations

11 Survival Analysis and Empirical Likelihood

Review of One-way Tables and SAS

Unsupervised Learning with Permuted Data

This chapter follows from the work done in Chapter 4 of the Core topics book involving quadratic equations.

Good Confidence Intervals for Categorical Data Analyses. Alan Agresti

n y π y (1 π) n y +ylogπ +(n y)log(1 π).

Categorical Variables and Contingency Tables: Description and Inference

A COMPARISON OF POISSON AND BINOMIAL EMPIRICAL LIKELIHOOD Mai Zhou and Hui Fang University of Kentucky

On the Chi square and higher-order Chi distances for approximating f-divergences

Local Minimax Testing

arxiv: v1 [stat.me] 2 Mar 2015

Discriminant Analysis with High Dimensional. von Mises-Fisher distribution and

Statistical Inference on Constant Stress Accelerated Life Tests Under Generalized Gamma Lifetime Distributions

Research Article A Nonparametric Two-Sample Wald Test of Equality of Variances

Irr. Statistical Methods in Experimental Physics. 2nd Edition. Frederick James. World Scientific. CERN, Switzerland

CONVERTING OBSERVED LIKELIHOOD FUNCTIONS TO TAIL PROBABILITIES. D.A.S. Fraser Mathematics Department York University North York, Ontario M3J 1P3

Contents. Preface to Second Edition Preface to First Edition Abbreviations PART I PRINCIPLES OF STATISTICAL THINKING AND ANALYSIS 1

On the Chi square and higher-order Chi distances for approximating f-divergences

STATISTICS SYLLABUS UNIT I

The Minimum Message Length Principle for Inductive Inference

What is Singular Learning Theory?

On the entropy flows to disorder

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

TESTING FOR EQUAL DISTRIBUTIONS IN HIGH DIMENSION

Lecture 32: Asymptotic confidence sets and likelihoods

On the entropy flows to disorder

Interval Estimation for Parameters of a Bivariate Time Varying Covariate Model

DIAGNOSTICS FOR STRATIFIED CLINICAL TRIALS IN PROPORTIONAL ODDS MODELS

Analysis of the AIC Statistic for Optimal Detection of Small Changes in Dynamic Systems

STAT 135 Lab 6 Duality of Hypothesis Testing and Confidence Intervals, GLRT, Pearson χ 2 Tests and Q-Q plots. March 8, 2015

Stat 5421 Lecture Notes Proper Conjugate Priors for Exponential Families Charles J. Geyer March 28, 2016

Conditional Inference by Estimation of a Marginal Distribution

Clustering with k-means and Gaussian mixture distributions

Copulas. Mathematisches Seminar (Prof. Dr. D. Filipovic) Di Uhr in E

Introduction to Statistical Analysis

Components of the Pearson-Fisher Chi-squared Statistic

Two Correlated Proportions Non- Inferiority, Superiority, and Equivalence Tests

A nonparametric two-sample wald test of equality of variances

Research Article Inference for the Sharpe Ratio Using a Likelihood-Based Approach

1 Exercises for lecture 1

Stat 5101 Notes: Brand Name Distributions

Bootstrap Tests: How Many Bootstraps?

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing

Probability for Statistics and Machine Learning

Likelihood based Statistical Inference. Dottorato in Economia e Finanza Dipartimento di Scienze Economiche Univ. di Verona

Testing Some Covariance Structures under a Growth Curve Model in High Dimension

Probability Distributions.

Modern Likelihood-Frequentist Inference. Donald A Pierce, OHSU and Ruggero Bellio, Univ of Udine

Finite Singular Multivariate Gaussian Mixture

Clustering with k-means and Gaussian mixture distributions

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization.

Refining the Central Limit Theorem Approximation via Extreme Value Theory

f-divergence Estimation and Two-Sample Homogeneity Test under Semiparametric Density-Ratio Models

Chapter 3 : Likelihood function and inference

Part 1.) We know that the probability of any specific x only given p ij = p i p j is just multinomial(n, p) where p k1 k 2

Degenerate Expectation-Maximization Algorithm for Local Dimension Reduction

The Multinomial Model

Unsupervised machine learning

On prediction and density estimation Peter McCullagh University of Chicago December 2004

Statistical distributions: Synopsis

STAT 705: Analysis of Contingency Tables

Robustness of Principal Components

Transcription:

Geometry of Goodness-of-Fit Testing in High Dimensional Low Sample Size Modelling Paul Marriott 1, Radka Sabolova 2, Germain Van Bever 2, and Frank Critchley 2 1 University of Waterloo, Waterloo, Ontario, Canada 2 The Open University, Milton Keynes, UK Abstract. We introduce a new approach to goodness-of-fit testing in the high dimensional, sparse extended multinomial context. The paper takes a computational information geometric approach, extending classical higher order asymptotic theory. We show why the Wald equivalently, the Pearson χ 2 and score statistics are unworkable in this context, but that the deviance has a simple, accurate and tractable sampling distribution even for moderate sample sizes. Issues of uniformity of asymptotic approximations across model space are discussed. A variety of important applications and extensions are noted. 1 Introduction A major contribution of classical information geometry to statistics is the geometric analysis of higher order asymptotic theory, see the seminal work [2] and for example [5]. It has excellent tools for constructing higher order corrections to approximations of sampling distributions, an example being the work on the geometry of Edgeworth expansions in [2, Chapter 4]. These expressions use curvature terms to correct for skewness and other higher order moment cumulant) issues and provide good, operational corrections to sampling distributions, such as those in Fig. 3 b) and c) below. However, as discussed in [6] and [3], these curvature terms grow unboundedly as the boundary of the probability simplex is approached. Since this region plays a key role in modelling in the sparse setting the MLE often being on the boundary extensions to the classical theory are needed. This paper starts such a development. Independently, there has been increased interest in categorical, hierarchical) log-linear and graphical models. See, in particular, [10], [8], [7], and [6]. As stated by [7] [their] statistical properties under sparse settings are still very poorly understood. As a result, [analysis of such data] remains exceptionally difficult. This paper is an introduction to a novel approach which combines and extends these two areas. The extension comes from using approximations based on the asymptotics of high dimensionality k-asymptotics) rather than the more familiar sample size approach N-asymptotics). This is connected to, but distinct from, the landmark paper by [12] and related work. In particular, for a practical example of so-called sparse-data asymptotics, see [1, 6.3]. Computational information geometry in all its forms: see, for example, [11], [13] [4], and [6] has

been a significant recent development, and this paper is a further contribution to it. We address the challenging problems which arise in the high dimensional sparse extended multinomial context where the dimension k of the underlying probability simplex, one less than the number of categories or cells, is much more than the number of observations N, so that boundary effects necessarily occur, see [3]. In particular, arbitrarily small possibly, zero) expected cell frequencies must be accommodated. Hence we work with extended multinomial models thus taking us out of the manifold structure of classical information geometry, [4]. For practical relevance, our primary focus is on a) accurate, finite sample and dimension approximation, rather than asymptotic limiting results per se; and b) performance at or near the boundary, rather than as in earlier studies) the centre of the simplex. Section 2.1 shows why the Wald statistic identical, here, to the Pearson χ 2 or score statistic is unworkable in this context. In contrast analysis and simulation exercises 2.2) indicate that the same is not true of the deviance D. We demonstrate that a simple normal or shifted χ 2 ) approximation to the distribution of D is accurate and tractable even as the boundary is approached. In contrast to other approaches, this appears to hold effectively uniformly across the simplex. The worst place is at its centre where all cells are equiprobable), due to discretisation effects. However, further theory shows that, even here, the accuracy of approximation improves without limit when N, k with N/k c > 0. Section 3 considers the uniformity of asymptotic approximations. Its three subsections address issues associated with the boundary, higher moments and discreteness, respectively. 2 Analysis 2.1 Why the Wald statistic is unworkable With i ranging over {0, 1,..., k}, let n = n i ) Multinomial N, π i )), where here each π i > 0. In this context the Wald, Pearson s χ 2, and score statistics all coincide, their common value, W, being W := π i n i /N) 2 π i 1 N 2 n 2 i π i 1. Defining π α) := i πα i we note the inequality, for each m 1, { π m) k + 1) m+1} 0, in which equality holds if and only if π i 1/k + 1) i.e. iff π i ) is uniform. We then have the following theorem, which establishes that the statistic W is unworkable as π min := minπ i ) 0 for fixed k and N.

Theorem 1. For k > 1 and N 6, the first three moments of W are: EW ) = k N, varw ) = { π 1) k + 1) 2} + 2kN 1) N 3 and E[{W EW )} 3 ] given by { π 2) k + 1) 3} 3k + 25 22N) { π 1) k + 1) 2} + gk, N) N 5 where gk, N) = 4N 1)kk + 2N 5) > 0. In particular, for fixed k and N, as π min 0 varw ) and γw ) + where γw ) := E[{W EW )} 3 ]/{varw )} 3/2. 2.2 The deviance statistic Unlike the triumvirate of statistics above, the deviance has a workable distribution in the same limit: that is, for fixed N and k as we approach the boundary of the probability simplex. The paper [3] demonstrated the lack of uniformity across this simplex of standard first order N-asymptotic approximations. In sharp contrast to this we see the very stable and workable behaviour of the k-asymptotic approximation to the distribution of the deviance. Define the deviance D via D/2 = {0 i k:n i>0} n i logn i /N) = {0 i k:n i>0} n i logn i /µ i ), n i logπ i ) where µ i := En i ) = Nπ i. We will exploit the characterisation that the multinomial random vector n has the same distribution as a vector of independent Poisson random variables conditioned on their sum. Specifically, let the elements of n i ) be independently distributed as Poisson P oµ i). Then, N := k n i P on), while n i ) := n i N = N) MultinomialN, π i )). Define ) S N := D = /2 ) n i n i logn i /µ i) where D is defined implicitly and 0 log 0 := 0. The terms ν, τ and ρ are defined by the first two moments of S via ) ) N := ES N ) = ν k En i log {n i /µ, i})

) N ρτ N τ 2 := covs N ) k ) = C i k V, i where C i := Covn i, n i logn i /µ i)) and V i := V arn i logn i /µ i)). Careful analysis gives: Theorem 2. Each of these terms ν, τ and ρ are bounded as π min 0 and hence the distribution of the deviance is stable in this limit. Moreover, these terms can be easily and accurately approximated using standard truncate and bound computational methods, exploited below. Under standard conditions, discussed in detail in 3, the multivariate central limit theorem CLT) gives for large k, but for fixed N, that S is approximately distributed as a bivariate normal N 2 ES ), covs )). Standard normal theory then gives, in the same limit, D/2 = D /2 {N = N} N 1 ν, τ 2 1 ρ 2 )). 1) a) Null distribution b) Sample of Wald Statistic c) Sample of Deviance Statistic Cell probability 0.00 0.01 0.02 0.03 0.04 0.05 Wald Statistic 0 500 1500 2500 3500 Deviance 50 60 70 80 90 100 110 Rank of cell probability 0 200 400 600 800 1000 Index 0 200 400 600 800 1000 Index Fig. 1. Stability of the sampling distributions 3 Uniformity of asymptotic approximations 3.1 Uniformity near the boundary In general asymptotic approximations are not uniformly accurate as is shown in [3]. Consider the consequences of Theorem 1 when π min is close to zero as illustrated in Fig. 1. This shows, in panel a), the distribution, π, where we see that π min is indeed very small. Here, and throughout, we plot the distributions in rank order without loss since all sampling distributions considered are invariant to permutation of the labels of the multinomial cells. Panel b) shows a sample of 1000 values of W drawn from its distribution when there are N = 50 observations in dimension k = 200. The extreme non-normality, and hence the failure of the standard N-asymptotic approximation, is evident. In contrast, consider panel

c), which shows 1000 replicates of D for the same N, k) values. The much greater stability, which is implied by Approximation 1), is extremely clear in this case. The performance of Approximation 1) can, in fact, be improved by simple adjustments. Here we show a couple of examples in Fig. 2. Panel a) shows a QQ-plot of the deviance, against the normal, in the case where the underlying distribution is shown in Fig. 1 a) one that is very close to the boundary. We see the normal approximation is good but shows some skewness. Panel b) shows a scaled χ 2 -approximation, designed to correct for skewness in the sampling distribution, while panel c) shows a symmetrised version of the deviance statistic which, if it is used for testing against a two tailed alternative, is a valid procedure. Both these simple corrections show excellent performance. Normal Approximation Chi squared Approximation Symmetrised Deviance Normal quantiles 50 60 70 80 90 Chi squared quantiles 50 60 70 80 90 100 Normal quantiles 50 60 70 80 90 40 60 80 100 120 Deviance quantiles 60 80 100 120 Deviance quantiles 40 60 80 100 120 Symmetric Deviance quantiles Fig. 2. Evaluation of the quality of k-asymptotic approximations Having seen that the N-asymptotic approximation does not hold uniformly across the simplex, it is natural to investigate the uniformity of the k-asymptotic approximation given by 1). This approximation exploited a bivariate normal approximation to the distribution of S = N, D /2) T and it is sufficient to check the normal approximation to any linear function of N and D /2. In particular, initially, we focus on the component D. We note that we can express D /2 via where D /2 = {0 i k:n i >0} n i logn i /µ i ) = Γ + 2) Γ := α i n i and := {0 i k:n n i log n i 0 i>1} and α i := log µ i. It is insightful to consider the terms Γ and separately. 3.2 Uniformity and higher moments One way of assessing the quality of the k-asymptotic approximation for the distribution of Γ would be based on how well the moment generating function

of the standardised) Γ is approximated by that of a standard) normal. Writing the moment generating function as ) ) M γ t) = exp EΓ ) h exp 1) h µ i log µ i ) h t V arγ ) V arγ ) then, when analysing where the approximation would break down, it is natural to make the third order term i.e. the skewness) h=1 µ i log µ i ) 3 as large as possible for fixed mean EΓ ) = k µ i logµ i ) and V arγ ) = k µ ilog µ i ) 2. Solving this optimisation problem gives a distribution with three distinct values for µ i. An example of this is shown in Fig. 3, where k = 200. Panels b) and c) are histograms for a sample of 1000 values of W and D, respectively, drawn from their distribution when N = 30. In this example, we see both the Wald and deviance statistics are close to normal but with significant skewness which disappears with a larger sample size. This is to be expected from the analysis a) Null distribution b) Sample of Wald Statistic c) Sample of Deviance Statistic Cell probability 0.000 0.002 0.004 0.006 Wald Statistic Deviance Rank of cell probability 160 180 200 220 240 260 280 300 out1) 110 115 120 125 130 135 out Fig. 3. Worst case solution for normality of Γ of [9] and [12] who look at the behaviour of deviance, when bounded away from the boundary of the simplex, when both N and k tend to infinity together. In particular [9] shows the accuracy of this normal approximation improves without limit when N, k with N/k c > 0. 3.3 Uniformity and discreteness In fact the hardest cases for the normal approximation 1) to the distribution of the deviance are in complementary parts of the simplex to the hardest cases from the Wald statistic. For W, it is the boundary where there are problems, while for 1) the worst place is the centre of the simplex, i.e. the uniform distribution.

The difficulties there are not due to large higher order moments, but rather to discreteness. In this analysis consider again decomposition 2). Note that Γ is completely degenerate here, while there are never any contributions to the term from cells for which n i is 0 or 1. However, for k >> N, we would expect that for all i, n i {0, 1}, with high probability, hence, after conditioning on N = N there is no variability in D it has a completely degenerate singular) distribution. In the general case all the variability comes from the cases where n i > 1 and these events can have a very discrete distribution so the approximation given by the continuous normal must be poor. We illustrate this granular behaviour in Fig. 4. Panel a) shows the uniform distribution when k = 200, panel b) displays 1000 realisations of D when N = 30. The discreteness of the distribution is very clear here, and is also illustrated in panel c) which shows a QQ-plot of the sample against a normal distribution. Note that any given quantile is not far from the normal, but the discreteness of the underlying distribution means that not all quantiles can be attained. This may, or may not, be a problem in a goodness-of-fit testing situation. a) Null distribution b) Sample of Deviance Statistic c) QQplot Deviance Statistic Cell probability 0.000 0.001 0.002 0.003 0.004 0.005 Deviance 115 120 125 130 135 Standardised Deviance 1 0 1 2 3 4 Rank of cell probability 0 200 400 600 800 1000 Index 3 2 1 0 1 2 3 Theoretical Quantiles Fig. 4. Behaviour at the centre of the simplex, N=30 Again following the analysis of [9] this behaviour will disappear as N gets larger relative to k. This is shown in Fig. 5 where the N is now 60 twice what it was in Fig. 4. The marked drop in granularity of panel b) between Figures 4 and 5 is due to the much greater variability in the maximum observed value of n i as N increases. Clearly, for the distribution of any discrete random variable to be well approximated by a continuous one, it is necessary that it have a large number of support points, close together. The good news here is that, for the deviance, this condition appears also to be sufficient. 4 Discussion Overall, we have seen that the deviance remains stable and eminently useable in high-dimensional, sparse contexts of accelerating practical importance. Discreteness issues are rare, predictable and well-understood, while simple modifications are available to deal with any higher moment concerns, such as skewness.

a) Null distribution b) Sample of Deviance Statistic c) QQplot Deviance Statistic Cell probability 0.000 0.001 0.002 0.003 0.004 0.005 Deviance 150 160 170 180 190 Standardised Deviance 2 1 0 1 2 3 Rank of cell probability 0 200 400 600 800 1000 Index 3 2 1 0 1 2 3 Theoretical Quantiles Fig. 5. Behaviour at the centre of the simplex, N=60 When using the deviance, computational information geometry can be used to gain insight into the power of the implicit likelihood ratio test, exploiting the fact that D is constant on high-dimensional affine subspaces in the mean parameterisation, [6], while both its null and alternative approximating distributions depend only on a few low-order moments, inducing a pivotal foliation. Acknowledgements: The authors acknowledge with gratitude the support of EPSRC grant EP/L010429/1. References 1. A. Agresti. Categorical Data Analysis. Wiley: Hoboken NJ, 2002. 2. S.-I. Amari. Differential-geometrical methods in statistics. Springer-Verlag, 1985. 3. K. Anaya-Izquierdo, F. Critchley, and P. Marriott. When are first order asymptotics adequate? a diagnostic. STAT, 3:17 22, 2014. 4. K. Anaya-Izquierdo, F. Critchley, P. Marriott, and P. Vos. Computational information geometry: foundations. Proceedings of GSI 2013, LNCS, 2013. 5. O.E. Barndorff-Nielsen and D.R. Cox. Inference and asymptotics. Chapman & Hall, 1994. 6. F. Critchley and Marriott P. Computational information geometry in statistics: theory and practice. Entropy, 16:2454 2471, 2014. 7. S.E. Fienberg and A. Rinaldo. Maximum likelihood estimation in log-linear models. Annals of Statistics, 40:996 1023, 2012. 8. C. J. Geyer. Likelihood inference in exponential families and directions of recession. Electronic Journal of Statistics, 3:259 289, 2009. 9. L. Holst. Asymptotic normality and efficiency for certain goodness of fit tests. Biometrika, 59:137 145, 1972. 10. S.L. Lauritzen. Graphical Models. Clarendon Press, Oxford, 1996. 11. M. Liu, B.C. Vemuri, S.-I. Amari, and F. Nielsen. Shape retrieval using heirarchical total Bregman soft clustering. IEEE Transactions on pattern analysis and machine intelligence, 34:2407 2419, 2012. 12. C. Morris. Central limit theorems for multinomial sums. Annals of Statistics, 3:165 188, 1975. 13. F. Nielsen and N. Nock. Optimal interval clustering: Application to Bregman clustering and statistical mixture learning. IEEE Transactions on pattern analysis and machine intelligence, 2110):1289 1292, 2014.