Modified Kolmogorov-Smirnov Test of Goodness of Fit. Catalonia-BarcelonaTECH, Spain

Similar documents
Time Series of Proportions: A Compositional Approach

The Goodness-of-fit Test for Gumbel Distribution: A Comparative Study

Chapter 31 Application of Nonparametric Goodness-of-Fit Tests for Composite Hypotheses in Case of Unknown Distributions of Statistics

Investigation of goodness-of-fit test statistic distributions by random censored samples

Goodness-of-fit tests for randomly censored Weibull distributions with estimated parameters

Bayesian trend analysis for daily rainfall series of Barcelona

Updating on the Kernel Density Estimation for Compositional Data

Principal balances.

CoDa-dendrogram: A new exploratory tool. 2 Dept. Informàtica i Matemàtica Aplicada, Universitat de Girona, Spain;

The Mathematics of Compositional Analysis

Exact goodness-of-fit tests for censored data

The Dirichlet distribution with respect to the Aitchison measure on the simplex - a first approach

Distribution Fitting (Censored Data)

Modeling Extremal Events Is Not Easy: Why the Extreme Value Theorem Cannot Be As General As the Central Limit Theorem

Fitting the generalized Pareto distribution to data using maximum goodness-of-fit estimators

Exact goodness-of-fit tests for censored data

Finding Outliers in Monte Carlo Computations

Reducing Model Risk With Goodness-of-fit Victory Idowu London School of Economics

Application of statistical methods for the comparison of data distributions

A class of probability distributions for application to non-negative annual maxima

Recall the Basics of Hypothesis Testing

Statistic Distribution Models for Some Nonparametric Goodness-of-Fit Tests in Testing Composite Hypotheses

HANDBOOK OF APPLICABLE MATHEMATICS

The Metalog Distributions

Irr. Statistical Methods in Experimental Physics. 2nd Edition. Frederick James. World Scientific. CERN, Switzerland

Fall 2012 Analysis of Experimental Measurements B. Eisenstein/rev. S. Errede

A comparison study of the nonparametric tests based on the empirical distributions

Quadratic Statistics for the Goodness-of-Fit Test of the Inverse Gaussian Distribution

arxiv: v2 [stat.me] 16 Jun 2011

Test of fit for a Laplace distribution against heavier tailed alternatives

Journal of Biostatistics and Epidemiology

A GOODNESS-OF-FIT TEST FOR THE INVERSE GAUSSIAN DISTRIBUTION USING ITS INDEPENDENCE CHARACTERIZATION

Application of Homogeneity Tests: Problems and Solution

Journal of Environmental Statistics

An EM-Algorithm Based Method to Deal with Rounded Zeros in Compositional Data under Dirichlet Models. Rafiq Hijazi

LM threshold unit root tests

One-Sample Numerical Data

Sequential Procedure for Testing Hypothesis about Mean of Latent Gaussian Process

On the interpretation of differences between groups for compositional data

Package homtest. February 20, 2015

Research Article The Laplace Likelihood Ratio Test for Heteroscedasticity

A TEST OF FIT FOR THE GENERALIZED PARETO DISTRIBUTION BASED ON TRANSFORMS

Methodological Concepts for Source Apportionment

Declarative Statistics

Bayes spaces: use of improper priors and distances between densities

Spline Density Estimation and Inference with Model-Based Penalities

arxiv: v1 [stat.me] 2 Mar 2015

Extreme Value Theory as a Theoretical Background for Power Law Behavior

Some Theoretical Properties and Parameter Estimation for the Two-Sided Length Biased Inverse Gaussian Distribution

arxiv: v1 [stat.me] 22 Feb 2018

H 2 : otherwise. that is simply the proportion of the sample points below level x. For any fixed point x the law of large numbers gives that

Exact Statistical Inference in. Parametric Models

A copula goodness-of-t approach. conditional probability integral transform. Daniel Berg 1 Henrik Bakken 2

MATH 240. Chapter 8 Outlines of Hypothesis Tests

MONTE CARLO ANALYSIS OF CHANGE POINT ESTIMATORS

Lecture 2: CDF and EDF

GOODNESS-OF-FIT TEST FOR RANDOMLY CENSORED DATA BASED ON MAXIMUM CORRELATION. Ewa Strzalkowska-Kominiak and Aurea Grané (1)

A NOTE ON SOME RELIABILITY PROPERTIES OF EXTREME VALUE, GENERALIZED PARETO AND TRANSFORMED DISTRIBUTIONS

Statistical methods and data analysis

Exploring Compositional Data with the CoDa-Dendrogram

INDIAN INSTITUTE OF SCIENCE STOCHASTIC HYDROLOGY. Lecture -27 Course Instructor : Prof. P. P. MUJUMDAR Department of Civil Engg., IISc.

Testing Goodness-of-Fit for Exponential Distribution Based on Cumulative Residual Entropy

NEW CRITICAL VALUES FOR THE KOLMOGOROV-SMIRNOV STATISTIC TEST FOR MULTIPLE SAMPLES OF DIFFERENT SIZE

Compositional data analysis of element concentrations of simultaneous size-segregated PM measurements

MODEL FOR DISTRIBUTION OF WAGES

Goodness of Fit Tests for Rayleigh Distribution Based on Phi-Divergence

Multivariate Non-Normally Distributed Random Variables

Computer simulation on homogeneity testing for weighted data sets used in HEP

EXTREMAL QUANTILES OF MAXIMUMS FOR STATIONARY SEQUENCES WITH PSEUDO-STATIONARY TREND WITH APPLICATIONS IN ELECTRICITY CONSUMPTION ALEXANDR V.

On the relationship between objective and subjective inequality indices and the natural rate of subjective inequality

Overview of Extreme Value Theory. Dr. Sawsan Hilal space

REFERENCES AND FURTHER STUDIES

General Glivenko-Cantelli theorems

A comparison of inverse transform and composition methods of data simulation from the Lindley distribution

A goodness of fit test for the Pareto distribution

Regression with Compositional Response. Eva Fišerová

Notation Precedence Diagram

ASTIN Colloquium 1-4 October 2012, Mexico City

Two-sample scale rank procedures optimal for the generalized secant hyperbolic distribution

Joseph O. Marker Marker Actuarial a Services, LLC and University of Michigan CLRS 2010 Meeting. J. Marker, LSMWP, CLRS 1

A Note on Tail Behaviour of Distributions. the max domain of attraction of the Frechét / Weibull law under power normalization

Zipf's Law for Cities: On a New Testing Procedure 1

CIVL /8904 T R A F F I C F L O W T H E O R Y L E C T U R E - 8

unadjusted model for baseline cholesterol 22:31 Monday, April 19,

An Architecture for a WWW Workload Generator. Paul Barford and Mark Crovella. Boston University. September 18, 1997

How Indecisiveness in Choice Behaviour affects the Magnitude of Parameter Estimates obtained in Discrete Choice Models. Abstract

Dr. Wolfgang Rolke University of Puerto Rico - Mayaguez

Bivariate Rainfall and Runoff Analysis Using Entropy and Copula Theories

Assessing Normality: Applications in Multi-Group Designs

A COMPARISON OF POISSON AND BINOMIAL EMPIRICAL LIKELIHOOD Mai Zhou and Hui Fang University of Kentucky

LQ-Moments for Statistical Analysis of Extreme Events

Approaching predator-prey Lotka-Volterra equations by simplicial linear differential equations

Introduction to Algorithmic Trading Strategies Lecture 10

Abstract: In this short note, I comment on the research of Pisarenko et al. (2014) regarding the

A scale-free goodness-of-t statistic for the exponential distribution based on maximum correlations

Monte Carlo Studies. The response in a Monte Carlo study is a random variable.

Department of Statistics, School of Mathematical Sciences, Ferdowsi University of Mashhad, Iran.

Explicit Bounds for the Distribution Function of the Sum of Dependent Normally Distributed Random Variables

THE CLOSURE PROBLEM: ONE HUNDRED YEARS OF DEBATE

A simple graphical method to explore tail-dependence in stock-return pairs

Transcription:

152/304 CoDaWork 2017 Abbadia San Salvatore (IT) Modified Kolmogorov-Smirnov Test of Goodness of Fit G.S. Monti 1, G. Mateu-Figueras 2, M. I. Ortego 3, V. Pawlowsky-Glahn 2 and J. J. Egozcue 3 1 Department of Economics, Management and Statistics, University of Milano-Bicocca, Italy gianna.monti@unimib.it 2 Department of Computer Science, Applied Mathematics, and Statistics, University of Girona, Spain 3 Department of Civil and Environmental Engineering, Technical University of Catalonia-BarcelonaTECH, Spain Abstract A modified version of the Kolmogorov-Smirnov (KS) test is presented as a tool to assess whether a specified, although arbitrary, probability model is unsuitable to describe the underlying distribution of a set of observations. The KS test computes distances between points of the sample cumulative distribution function and the hypothetical one as absolute differences between them, and then considering the supreme distance as test statistics. The modification here proposed consists of computing the mentioned distances as Aitchison distances of the probabilities as two part compositions. In this contribution, we investigate by simulation the asymptotic distribution of the proposed test statistic, checking the appropriateness of the Gumbel distribution. The properties of the asymptotic distribution are studied for samples coming from generic distributions such as uniform, normal, lognormal, gamma, beta and exponential with different values of the parameters. A brief Monte Carlo investigation is made of the type I error and power of the test. 1 Introduction The main purpose of this paper is to develop a goodness of fit test to assess the appropriateness of a certain theoretical distribution to the empirical one given a sample. We propose a modified version of the Kolmogorov-Smirnov test, which considers the largest absolute difference between two cumulative distribution functions (CDFs) as a dissimilarity. Section 2 presents the modified KS statistic which we propose in this paper. Section 3 deals with a Monte Carlo simulation study in order to investigate the asymptotic distribution of the proposed statistic and also to investigate the type I error and power of the test. Section 4 reports some comments on our proposal, which is just a first attempt to provide a log-ratio approach to a goodness of fit test, and suggests possible relationships between the sample size and the form of the test statistics. 2 The modified Kolmogorov-Smirnov statistic Consider an independent sample, denoted x = (x 1,..., x i,..., x n ), coming from a continuous random variable X. Let the hypothetical CDF be F (x θ), where θ are the parameters of F. We formulate the hypothesis H 0 : X F ( θ), against the alternative that the random variable does not follow the claimed distribution. H 0 can be tested using the well-known Kolmogorov-Smirnov (KS) statistic introduced by Kolmogorov (1933), which is a tool to assess whether a specified probability model is suitable to describe the underlying distribution of a set of observations. The expression of the KS statistic is D KS = sup F n (x) F (x), x R

CoDaWork 2017 Abbadia San Salvatore (IT) 153/304 where F n (x) = 1 n n 1{X i x}, i=1 is the empirical distribution function (EDF) of the sample and counts the proportion of the sample points less than or equal to x, and where 1{A} is the indicator of event A. In the context of tests of fit the Kolmogorov-Smirnov statistic can be formulated as follows. Suppose that F (x) is a continuous distribution, to be tested as the parent distribution of a given random sample X 1,..., X n. Let X (1), X (2),..., X (n) be the order statistics (i = 1,..., n), and consider the largest difference at the points where the EDF is greater than F (x) and the largest difference at the points where the EDF is smaller than F (x) as { i D + KS (i))} = max i=1,...,n n F (X, D KS = max i=1,...,n then, the Kolmogorov-Smirnov statistic is { F (X (i) ) (i 1) }, n D KS = max { D + KS, D KS}. (2) (1) The distribution of this statistic is known, even for finite samples (Birnbaum, 1952; Darling, 1957), and tables are available (Owen, 1962; D Agostino and Stephens, 1986). The modification of the KS test statistics in Equation (2) proposed here consists in replacing the absolute difference between the sample and the hypothetical CDF, which is a distance between real numbers, by a suitable difference for probabilities. Probabilities, like for instance i/n and F (x (i) ), can be considered as two part compositions, like for instance (i/n, 1 i/n) and (F (x (i) ), 1 F (x (i) )). In this case, a natural way of measuring the distance between probabilities is adopting the Aitchison distance (Aitchison, 1983; Aitchison et al., 2001). For 2-part compositions the Aitchison square distance between p 1 = (p 1, 1 p 1 ) and p 2 = (p 2, 1 p 2 ) is ( 1 d 2 p a(p 1, p 2 ) = 2 1 ln 1 ) 2 p 2 ln, 1 p 1 2 1 p 2 which is the square difference between the logit transforms of p 1 and p 2 up to the factors 1/ 2. Therefore, the Aitchison distance between two probabilities, d 2 a(p 1, p 2 ), can be identified with d 2 a(p 1, p 2 ). Under this perspective, we propose to consider { ( i )} D a + = max d a i=1,...,n 1 n, F (X (i)), D a and the modified KS statistic is = max i=2,...,n {d a ( F (X (i) ), (i 1) )}, n D a = max { D + a, D a }. (4) Note that the ranges of the index i in Equations (3) have been modified with respect to Equation (1), thus excluding infinite distances. In fact, a probability equal to 0 or equal to 1 is always at an infinite distance of other probabilities considered as compositions. An important property of D a as a test statistics is that it is invariant under a reversion of the orientation of the axis of the data. This means that the CDFs F (x θ), i/n, (i 1)/n can be substituted by 1 F (x θ), 1 i/n, 1 (i 1)/n respectively in Equation (3) and the value of D a does not change. This property is not fulfilled by the KS test statistics D KS. In order to complete a practical test, the distribution of the statistic in Equation (4) needs to be studied. However, the statistic (4) is the maximum of several distances. As a consequence, (3)

154/304 CoDaWork 2017 Abbadia San Salvatore (IT) the asymptotic distribution of D a is a generalized extreme value distribution (GEVD) (Embrechts et al., 1997). GEVD applies even in the case in which there is a weak dependence between the variables from which the maximum is computed. The appropriate type of GEVD is determined by the behaviour of the upper tail of the distances. In the present case, as the support of the distances is not bounded, the Weibull type of GEVD is excluded, and as the decay of the upper tails is exponential, the asymptotic distribution of the maximum is the Gumbel distribution (GEVD with ξ = 0) (Appendix A). Supported by a large number of Monte Carlo simulations (not shown here), we have observed that the D a statistic follows reasonably well a GEVD for maxima, that is, D A asymptotically follows a Gumbel distribution (Gumbel, 1954) and its parameters approximately depend on the sample size, as shown in Section 3. 3 Simulation results In order to investigate the parameters of the asymptotic distribution of the D a statistic, we have conducted a Monte Carlo study. The Monte Carlo (MC) procedure is as follows. A case consists of a single maximum likelihood estimation of the parameters of the Gumbel distribution. This case is obtained from m = 500 simulated samples coming from a given distribution with fixed sample size and parameters, i.e. we consider only the all-parameters-known case. For each case, the distribution model and the sample size are randomly selected. This is repeated for 1,000 different cases, thus obtaining 1,000 estimations of the scale and shape parameters of the Gumbel distribution fitted to D a. To obtain robust results, in these simulations a 5% trimmed D a statistic was used. The considered reference models were: normal and lognormal distribution with different mean and scale parameters, uniform distribution with several supports, exponential distribution with several rates, and gamma and beta distribution with great variety in the parameters. The sample sizes were randomly selected ranging from n = 5 up to n = 15, 000. Two regression models, linear and quadratic, of the 1,000 MC estimates of the Gumbel parameters, µ (location) and (scale), were estimated against the log-size of the sample. The results are displayed in Figure 1 and in Tables 1 and 2. Using the F-test to compare both models, the Gumbel's mu, sigma 0 1 2 3 4 2 3 4 5 6 7 8 log sample size Figure 1: MC results for the Gumbel parameters. Horizontal alignment for scale parameter. Tilted alignment for location parameter. Colours indicate different distribution models. Red and blue lines represent the estimated quadratic models.

CoDaWork 2017 Abbadia San Salvatore (IT) 155/304 quadratic appears to be better than the linear one. Table 1: Regression output for models (1) : µ = β 01 + β 11 ln(n) + ε and (2) : µ = β 01 + β 11 ln(n) + β 21(ln(n)) 2 + ε. Dependent variable: Coefficients (1) (2) β 1 0.159 0.729 (0.006) (0.019) µ β 2 0.057 (0.002) β 0 1.989 0.664 (0.027) (0.047) Observations 500 500 R 2 0.628 0.867 Adjusted R 2 0.627 0.866 Residual Std. Error 0.163 (df = 498) 0.098 (df = 497) F Statistic 840.711 (df = 1; 498) 1,619.445 (df = 2; 497) Note: p<0.1; p<0.05; p<0.01 Table 2: Regression output for models (1) : = β 02 + β 12 ln(n) + ε and (2) : = β 02 + β 12 ln(n) + β 22(ln(n)) 2 + ε. Dependent variable: Coefficients (1) (2) β 1 0.092 0.185 (0.001) (0.004) β 2 0.009 (0.0004) β 0 0.708 0.926 (0.005) (0.009) Observations 500 500 R 2 0.947 0.977 Adjusted R 2 0.947 0.977 Residual Std. Error 0.029 (df = 498) 0.019 (df = 497) F Statistic 8,947.071 (df = 1; 498) 10,484.430 (df = 2; 497) Note: p<0.1; p<0.05; p<0.01 A brief Monte Carlo investigation was made on the size (type I error) and on the power of the test. 5,000 samples of size n = 30, 50, 100 were drawn from each of several distributions. The probability of rejection using the modified Kolmogorov-Smirnov test (Tables 3 and 4) was determined. Results

156/304 CoDaWork 2017 Abbadia San Salvatore (IT) in Table 3 show a low conservative test in the sense that the actual significance level would be much greater than that given by the table, especially when the sample size is small. Table 3: Probability of rejecting the null hypothesis using D a (trim 5%) statistic with different sample sizes n. The numbers are the result of Monte Carlo simulations with 5,000 samples for each distribution. Underlying distribution Critical Level α n=30 n = 50 n = 100 N(0, 1) 0.05 0.0268 0.0242 0.0310 N(0, 1) 0.1 0.0644 0.0664 0.0964 Unif(1, 2) 0.05 0.0244 0.0232 0.0362 Unif(1, 2) 0.1 0.0706 0.0704 0.1024 Gamma(3, 5) 0.05 0.0236 0.0276 0.0392 Gamma(3, 5) 0.1 0.0674 0.0730 0.0996 Beta(2, 3) 0.05 0.0258 0.0272 0.0382 Beta(2, 3) 0.1 0.0686 0.0732 0.0974 Table 4: Probability of rejecting hypothesis of Standard Normal distribution using D a (trim 5%) statistic with different sample sizes n. The numbers are the result of Monte Carlo simulations with 5,000 samples for each distribution. Underlying distribution Critical Level α n=30 n = 50 n = 100 N(0, 4) 0.05 0.9232 0.9876 1.0000 0.1 0.9652 0.9962 1.0000 Student s t, 3 d.f. 0.05 0.4034 0.5360 0.8124 0.1 0.5274 0.6628 0.8992 Exponential, rate=1 0.05 0.6122 0.8002 0.9792 0.1 0.7276 0.8898 0.9922 Gamma, shape=rate=1 0.05 0.6160 0.8018 0.9788 0.1 0.7398 0.8908 0.9936 4 Discussion In this contribution we have proposed a modified version of the KS statistic. Although the test can be very useful in univariate statistics, the use in bivariate situations may be important, particularly to test goodness of fit for copulas. However, our proposal is just a first tentative to provide a logratio approach to a goodness of fit test, and to suggest possible relationships between the sample size and the form of the test statistics. Further studies are required to arrive at any definitive conclusions. Acknowledgements Research partially financially supported by the Italian Ministry of University and Research, FAR (Fondi di Ateneo per la Ricerca) 2015. The authors also gratefully acknowledge support by the Spanish Ministry of Education and Science under project CODA-RETOS (Ref. MTM2015-65016- C2-1 (2)-R (MINECO/FEDER,UE)) and by the Agència de Gestió d Ajuts Universitaris i de Recerca of the Generalitat de Catalunya under project COSDA (Ref. 2014SGR551).

CoDaWork 2017 Abbadia San Salvatore (IT) 157/304 REFERENCES Aitchison, J. (1983). Principal component analysis of compositional data. Biometrika 70 (1), 57 65. Aitchison, J., C. Barceló-Vidal, A. Martín-Fernández, and V. Pawlowsky-Glahn (2001). Reply to letter to the editor by S. Rehder and U. Zier on Logratio analysis and compositional distance. Mathematical Geology 33 (7), 849 860. Birnbaum, Z. W. (1952). Numerical tabulation of the distribution of kolmogorov s statistic for finite sample size. Journal of the American Statistical Association 47 (259), 425 441. Castillo, E. (1988). Extreme Value Theory in Engineering. Statistical Modeling and Decision Science. San Diego, Ca. (USA): Academic Press. Castillo, E., A. Hadi, N. Balakrishnan, and J. Sarabia (2004). Extreme value and related models with Applications in Engineering and Science. London, GB: Wiley. 384 p. D Agostino, R. and M. Stephens (1986). Goodness-of-fit Techniques. Statistics, textbooks and monographs. New York (USA): Marcel Dekker, INC. Darling, D. A. (1957). The kolmogorov-smirnov, cramer-von mises tests. The Annals of Mathematical Statistics 28 (4), 823 838. Embrechts, P., C. Kloppelberg, and T. Mikosch (1997). Modelling extremal values. Springer Verlag, Berlin. Gumbel, E. (1954). Statistical theory of extreme values and some practical applications, Volume 33. U.S. Department of Commerce, National Bureau of Standards, Applied Mathematics Series. (1st ed.). Kolmogorov, A. (1933). Sulla determinazione empirica di una legge di distribuzione. Giornale dell Istituto Italiano degli Attuari 4, 83 91. Kotz, S. and S. Nadarajah (2000). Extreme value distributions. Theory and applications. London, GB: Imperial College Press. 185 p. Owen, D. (1962). A Handbook of Statistical Tables. Addison-Wesley, Reading, Mass. A The Gumbel Distribution The material in this appendix is well known and can be found in Embrechts et al. (1997) or in Castillo (1988), among others. The generalized extreme value distribution (GEVD) has the expression (Von Mises-Jenkinson formula; Embrechts et al. (1997); Castillo et al. (2004); Kotz and Nadarajah (2000)) [ ( ( z µ ) ) ] 1/ξ F Z (z µ,, ξ) = exp 1 + ξ, 1 + ξ (z µ) > 0, (5) where µ is a location parameter, is a scale parameter and ξ is a shape parameter. Parameters µ and ξ have support on the whole real line, and is positive. The values of the shape parameter ξ define the three families of asymptotic distribution: Weibull for ξ < 0, Fréchet for ξ > 0 and Gumbel in the limit case ξ = 0. In particular, if ξ = 0 (Gumbel distribution), the expression (5) has the limit form [ ( F Z (z µ,, ξ = 0) = exp exp z µ )], z R. (6)

158/304 CoDaWork 2017 Abbadia San Salvatore (IT) The corresponding probability density function is f Z (z µ,, ξ = 0) = 1 ( exp z µ ) ( ( exp exp z µ )). The mean and variance of the GEVD-Gumbel distribution are, respectively, E(Z) = µ + 2 γ, Var(Z) = π2 6 2, where γ is the Euler-Mascheroni constant, γ = 0 e t ln t dt. The inverse CDF sampling technique could be used to generate a random sample from a Gumbel distribution: if U Unif(0, 1) then Y = F 1 (U) = ln( ln U) has the standard Gumbel distribution (with µ = 0 and = 1). Given this result, the calculation of critical values of the test probability distribution is easy to compute.