Computer simulation on homogeneity testing for weighted data sets used in HEP

Similar documents
H 2 : otherwise. that is simply the proportion of the sample points below level x. For any fixed point x the law of large numbers gives that

Distribution Fitting (Censored Data)

Hypothesis testing:power, test statistic CMS:

14.30 Introduction to Statistical Methods in Economics Spring 2009

CPSC 531: Random Numbers. Jonathan Hudson Department of Computer Science University of Calgary

Recall the Basics of Hypothesis Testing

Advanced Statistics II: Non Parametric Tests

Statistic Distribution Models for Some Nonparametric Goodness-of-Fit Tests in Testing Composite Hypotheses

TESTS BASED ON EMPIRICAL DISTRIBUTION FUNCTION. Submitted in partial fulfillment of the requirements for the award of the degree of

Robustness and Distribution Assumptions

Testing Statistical Hypotheses

The Goodness-of-fit Test for Gumbel Distribution: A Comparative Study

Nonparametric Bayesian Methods (Gaussian Processes)

Parameter Estimation of the Stable GARCH(1,1)-Model

Lecture 2: CDF and EDF

Notation Precedence Diagram

A TEST OF FIT FOR THE GENERALIZED PARETO DISTRIBUTION BASED ON TRANSFORMS

Forefront of the Two Sample Problem

Dr. Maddah ENMG 617 EM Statistics 10/15/12. Nonparametric Statistics (2) (Goodness of fit tests)

Slides 3: Random Numbers

HANDBOOK OF APPLICABLE MATHEMATICS

UNIT 5:Random number generation And Variation Generation

Generalized nonparametric tests for one-sample location problem based on sub-samples

Brief introduction to Markov Chain Monte Carlo

Non-parametric Inference and Resampling

NAG Library Chapter Introduction. G08 Nonparametric Statistics

A PARAMETRIC MODEL FOR DISCRETE-VALUED TIME SERIES. 1. Introduction

Irr. Statistical Methods in Experimental Physics. 2nd Edition. Frederick James. World Scientific. CERN, Switzerland

DUBLIN CITY UNIVERSITY

1 Glivenko-Cantelli type theorems

THE PAIR CHART I. Dana Quade. University of North Carolina. Institute of Statistics Mimeo Series No ~.:. July 1967

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Bayesian estimation of the discrepancy with misspecified parametric models

18.9 SUPPORT VECTOR MACHINES

STAT 6385 Survey of Nonparametric Statistics. Order Statistics, EDF and Censoring

Testing Goodness-of-Fit for Exponential Distribution Based on Cumulative Residual Entropy

Defect Detection using Nonparametric Regression

MODEL FOR DISTRIBUTION OF WAGES

SIMULATION SEMINAR SERIES INPUT PROBABILITY DISTRIBUTIONS

15-388/688 - Practical Data Science: Basic probability. J. Zico Kolter Carnegie Mellon University Spring 2018

Semiparametric Regression

2.1.3 The Testing Problem and Neave s Step Method

The bootstrap. Patrick Breheny. December 6. The empirical distribution function The bootstrap

Kumaun University Nainital

Economics 520. Lecture Note 19: Hypothesis Testing via the Neyman-Pearson Lemma CB 8.1,

A new method of nonparametric density estimation

Feature Selection by Reordering *

Predictive analysis on Multivariate, Time Series datasets using Shapelets

Modified Kolmogorov-Smirnov Test of Goodness of Fit. Catalonia-BarcelonaTECH, Spain

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Goodness of Fit Test and Test of Independence by Entropy

Adaptive Crowdsourcing via EM with Prior

Comparison of Power between Adaptive Tests and Other Tests in the Field of Two Sample Scale Problem

Lecture 7 Introduction to Statistical Decision Theory

ANALYSIS OF VARIANCE OF BALANCED DAIRY SCIENCE DATA USING SAS

IE 303 Discrete-Event Simulation L E C T U R E 6 : R A N D O M N U M B E R G E N E R A T I O N

z-transforms 21.1 The z-transform Basics of z-transform Theory z-transforms and Difference Equations 36

Monte Carlo Studies. The response in a Monte Carlo study is a random variable.

Testing Statistical Hypotheses

Chapter 4. Theory of Tests. 4.1 Introduction

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Testing Exponentiality by comparing the Empirical Distribution Function of the Normalized Spacings with that of the Original Data

Doubly Indexed Infinite Series

One-Sample Numerical Data

On Asymptotic Expansions of the Scale Mixtures of Stable Distribution

Lehrstuhl für Statistik und Ökonometrie. Diskussionspapier 87 / Some critical remarks on Zhang s gamma test for independence

Supplementary Materials for Residuals and Diagnostics for Ordinal Regression Models: A Surrogate Approach

Pattern Recognition. Parameter Estimation of Probability Density Functions

Subject CS1 Actuarial Statistics 1 Core Principles

Latin Hypercube Sampling with Multidimensional Uniformity

Computer Science, Informatik 4 Communication and Distributed Systems. Simulation. Discrete-Event System Simulation. Dr.

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Generalization bounds

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

SCHEME OF EXAMINATION

5 Introduction to the Theory of Order Statistics and Rank Statistics

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications.

NAG Library Chapter Introduction. g08 Nonparametric Statistics

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

A Test of Cointegration Rank Based Title Component Analysis.

Median Statistics Analysis of Non- Gaussian Astrophysical and Cosmological Data Compilations

STAT 518 Intro Student Presentation

NEW APPROXIMATE INFERENTIAL METHODS FOR THE RELIABILITY PARAMETER IN A STRESS-STRENGTH MODEL: THE NORMAL CASE

Open Problems in Mixed Models

Bivariate Paired Numerical Data

Testing Overidentifying Restrictions with Many Instruments and Heteroskedasticity

Investigation of goodness-of-fit test statistic distributions by random censored samples

Exact goodness-of-fit tests for censored data

Estimation of Quantiles

Numerical Analysis: Solving Systems of Linear Equations

Can we do statistical inference in a non-asymptotic way? 1

Approximating the Integrated Tail Distribution

Lecture 2. G. Cowan Lectures on Statistical Data Analysis Lecture 2 page 1

The University of Hong Kong Department of Statistics and Actuarial Science STAT2802 Statistical Models Tutorial Solutions Solutions to Problems 71-80

Pattern Recognition and Machine Learning

Likelihood Construction, Inference for Parametric Survival Distributions

Exact Statistical Inference in. Parametric Models

Notes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces

COMS 4771 Introduction to Machine Learning. Nakul Verma

Journal of Biostatistics and Epidemiology

Transcription:

Computer simulation on homogeneity testing for weighted data sets used in HEP Petr Bouř and Václav Kůs Department of Mathematics, Faculty of Nuclear Sciences and Physical Engineering, Czech Technical University in Prague, Trojanova 13, 12000 Praha 2, Czech Republic E-mail: petr.bour@fjfi.cvut.cz Abstract. Modified statistical homogeneity tests on weighted data samples are commonly used in high energy physics applications. We do typically apply the tests in order to test homogeneity of weighted and unweighted samples, e.g. Monte Carlo simulations compared to the real data measurements. The asymptotic approximation of p-value of our weighted variants of homogeneity tests are investigated by means of simulation experiments. The simulation is performed for various probability sample distributions. We show that the asymptotic characteristics of the weighted homogeneity tests are valid for the specific distribution of weights. 1. Introduction In high energy physics, homogeneity testing precedes many analysis and modeling techniques, particularly in machine learning (ML) applications. It is the case when we apply a data preprocessing procedure called data weighting. Via assigning weights w 1,..., w n > 0 to simulated observations x 1,..., x n, n N, we are able to fine-tune our Monte Carlo (MC) simulation data set with respect to our requirements. However, statistical theory concerning homogeneity tests does not handle any weighting procedures, nor associates weights with observations, except for sporadic works on weighted histograms, as e.g. [1]. Therefore, the classical homogeneity tests must be adjusted for weighted data sets. Despite relatively straightforward incorporation of weights into the classical homogeneity tests and their modification, asymptotic properties of these tests can be no longer guaranteed. Thus, our goal is to investigate the validity of asymptotic properties of homogeneity testing for weighted observations through computer simulation. The need for homogeneity testing may typically arise from a simple signal/backgrounds binary classification task. In this common ML application, we often use MC simulation for both ML classifier training and testing. We may then apply the trained classifier to real measured data set (DATA). Naturally, we expect both MC F and DATA G to be identically distributed: F G. Otherwise, the classification model will not perform well. 2. Weighted tests of homogeneity Let us assume it is vital to guarantee homogeneity of DATA and MC distributions prior to subsequent utilization of some ML methods. For this purpose, we first define an analogy with empirical distribution function (EDF) for weighted data set. Let X = (X 1,..., X n ) be iid random variables distributed by cumulative distribution function (CDF) F (x) and let

(w 1,..., w n ) be their corresponding weights, where W = n w i. We define the weighted empirical distribution function (WEDF) to be F W n (x) = 1 W n w i I (,x] (X i ), (1) where I A (X) is the indicator of the set A. Note that in the case of w i = 1 for all i = 1,..., n (i.e. unweighted DATA), the definition of WEDF goes over to usual EDF. In order to avoid an investigation of an unknown parametric family, we shall pursue our homogeneity testing only with nonparametric approaches. Thus, proceeding further in this section, ( we present ) the Kolmogorov-Smirnov ( test based upon EDFs of two data sets X 1 = X (1) 1,..., X(1) n 1, X 2 = X (2) 1,..., X(2) n 2 ), with respective distribution functions F, G. By the homogeneity hypothesis [3], as our null hypothesis is H 0, we understand H 0 : F = G vs H 1 : F G at significance level α (0, 1). (2) We require our homogeneity tests to meet the condition P (W C H 0 ) α, where W C is a critical region for the specific test statistic T. We reject hypothesis H 0 if T W C. The nature of homogeneity testing prompted us to look for the p-value, i.e., the lowest significance level α for which we reject hypothesis H 0. Thus, for every α > p-value we may automatically reject hypothesis H 0. 2.1. Two sample Kolmogorov-Smirnov test Let F n1, G n2 denote the EDFs of two data samples X 1, X 2 with respective sample sizes n 1, n 2. We consider the test statistic = sup F n1 (x) G n2 (x). (3) x R It is clear from the Glivenko-Cantelli lemma [2] that, under the true H 0, it holds that a.s. 0 for n 1, n 2. Furthermore, due to [4], it follows that for the true H 0 and λ > 0 ( ) lim P n1 n 2 D n1,n n 1 + n 2 λ = 1 2 ( 1) k 1 e 2k2 λ 2. (4) 2 Therefore, we obtain the approximate p-value as 2 k=1 ( 1)k 1 e 2k2 λ 2 0, where λ0 = n1 n 2 n 1 +n 2. However, for weighted data sample we are forced to replace EDFs F n1, G n2, and the numbers of entries n 1, n 2, with their respective WEDFs F W 1 n 1, G W 2 n 2, and the sums of weights W 1, W 2 in (3) and (4). Instead of (3), we thus obtain the test statistic D W 1,W 2 k=1 = sup F W 1 n 1 (x) G W 2 n 2 (x). (5) x R The definition (1) of WEDF makes it clear that the statistic D W 1,W 2 a.s. 0 for n 1, n 2 and W 1, W 2. Nevertheless, it is important to notice some of the weaknesses inherent in the above approach. This modified test for the weighted data sample does not have to obey the asymptotic property (4). Let us emphasize that the p-value obtained using the statistic D W 1,W 2 cannot be considered a regular approximate p-value without subsequent detailed research. At this point, we are not able to present a rigorous mathematical proof. However, we intent to supply the HEP community with some recommendations on the suitability of the weighted tests. This is why we propose partial validation of our approach in section 3. A key insight we provide is the stable numerical verification performed for a few fundamental data distributions.

3. Simulation experiments We have already mentioned the problem of insecure asymptotic properties when applying weighted modifications of the standard tests. We now turn our attention at the numerical simulation. For our purposes here, the best way would be to validate the asymptotic properties using standard unweighted tests. This requires us to plug into the testing an unweighted data set (only instead of weighted MC; DATA is unweighted already). We shall do this by appropriate transformation of the weighted ensemble MC into the unweighted ensemble MC [5]. 3.1. Rearranging technique We make two requirements for the transformation. Firstly, we desire to preserve or exploit information contained in MC weighting since the weight assigned to an observation refer to what extent the distribution should be present in the neighbourhood of this observation. Secondly, we require that the sum of weights in MC corresponds to the number of observations in the unweighted MC. Continuing in this manner, we now proceed as follows. Denote by X = ( ) X (1),..., X (n) the ordered sample in MC with weights (w1,..., w n ) and let W = n w i. Let N = W denote the desired number of observations in the new transformed ensemble MC. Given both our requirements regarding MC, we are constructing special weighted averages from X. For simplicity, we presume 0 w i 1 for all i = 1,..., n. Into the set of the first weighted average we include the smallest possible number of observations ( X (1),..., X (k1 )) such that 1 k 1 w i < 2. Thereby, l w i < 1 for all l < k 1. The portion of weight w k1 of the observation X (k1 ) which contributes above 1 to the sum k 1 w i will not be included into the first weighted average. Hence, we denote this residual portion by r k1 = k 1 w i 1. Thereafter, the first observation Y (1) in MC can be defined as the following weighted average Y (1) = k1 X (i)w i X (k1 )r k1 k1 w i r k1. (6) From the definition of r k1 we arrive at k 1 Y (1) = X (i) w i X (k1 )r k1 = k 1 1 X (i) w i + X (k1 )(w k1 r k1 ). (7) The residual portion r k1 will be added to the next weighted average for Y (2). In general, for Y (j) we write r kj = k j i=k j 1 +1 Y (j) = X (kj 1 )r kj 1 + w i r kj 1 1 (8) k j 1 i=k j 1 +1 X (i) w i + X (kj )(w kj r kj ). (9) Repeating the same steps we transform the original weighted ensemble MC with X = ( X(1),..., X (n) ) into the new unweighted ensemble MC with Y = ( Y (1),..., Y (ñ) ). We have distributed the weights from the MC so that there is the unit weight for each observation Y (j). Therefore, we are authorized to apply standard homogeneity tests, which guarantees the asymptotic properties.

3.2. Generic p-value validation We have verified an eligible usage of modified weighted tests in previous section with data sets originating from high energy physics [5]. Now, we provide a reader with more general verification. In our simulation, we consider several different distributions for X = (X 1,..., X n ): Beta, Cauchy, Exponential, Laplace, Logistic, Lognormal, Normal, Uniform and Weibull, whilst the weights W = (W 1,..., W n ) are taken from the Beta distribution, as we can easily tune the expected value, W Beta(α, β) = E [W ] = α α + β. (10) The appropriate number of data points was determined by preliminary convergence studies. Otherwise, the simulation steps proceed as follows: (i) Generate n random weighted data points (X, W ), e.g. n = 3500000. (ii) Estimate weighted distribution from all the observations (X, W ) using kernel density estimator. (iii) Repeat all the following items k times, e.g. k = 1000: (a) Choose m w = n k weighted observations from (X, W ) as the current MC sample, e.g. m w = 3500. (b) Generate m u m w w i unweighted observations from the estimated weighted distribution and consider them to be the current DATA sample, e.g. m u = 1000. (c) Apply weighted homogeneity test MC vs DATA. (d) Rearrange MC into unweighted sample MC and apply standard unweighted test. Thus, we obtain k individual p-values from the weighted tests and also another k corresponding p-values from the unweighted tests. We may now check asymptotic properties of both weighted and unweighted tests. For all the distributions under consideration we arrived at two main results. First, the condition P (W C H 0 ) α for significance level α is uniformly satisfied as it is shown in figure 1, i.e. both EDFs are located under diagonal line in the graph. Second, both weighted modifications and unweighted tests have the same resulting p-value distribution. This was tested via common classical homogeneity tests for unweighted data. Nevertheless, the extraordinary correspondence is already obvious from the graph in figure 1. Quite similar results were achieved for all other simulated distributions: Beta, Cauchy, Exponential, Laplace, Logistic, Normal, Uniform and Weibull. 4. Discussion We carried out a numerical validation of the modified statistical homogeneity tests for generated data sets under Beta-weighting. Our simulation verifies that the approximate asymptotic properties remain the same for both weighted and unweighted tests. In practice, we may either utilize modified weighted tests or we may apply the rearranging technique from section 3.1 directly with the unweighted standard tests (where the asymptotics are theoretically proven). However, our verification was performed for the specific choice of weights distribution only. In the future research, we aim to investigate the effect of various homogeneity tests and different weights distribution on the overall significance and power. We also plan to explore the possibility of theoretic validation of some representative weighted tests for wide-ranging data distribution families. It may be reached by appropriate restrictions imposed on the distribution of weights, since, in practice, there exists only a limited number of high energy physics applications under rather specific weighting. Please note we only aimed to provide empirical recommendations for weighted tests of homogeneity. A reader may use these only for an experimental setup with data and weights distributions similar to ours. On the other hand, this may cover large number of cases as HEP

data often come from these fundamental distributions or their mixture. In any case, the reader may always utilize our validation procedures to verify whether the output of his test corresponds to a regular approximate p-value. Figure 1. EDF of p-value for weighted and unweighted tests of homogeneity. Homogeneity of p-value distributions was tested via common classical homogeneity tests for unweighted data. Underlying data are taken from the lognormal distribution. Acknowledgments This work was supported by the grants LG15047 (MYES), LM2015068 (MYES), GA16-09848S (GACR), and SGS15/214/OHK4/3T/14 (CTU). References [1] Gagunashvili N D 2015 Chi-square goodness of fit tests for weighted histograms. Review and improvements JINST 10 P05004 [2] van der Vaart A W 1998 Asymptotic Statistics (Cambridge: Cambridge University Press) p 265 [3] DeGroot M H and Schervish M J 2010 Probability and Statistics (Boston: Pearson) p 647 [4] Smirnov N J 1944 Usp. Mat. Nauk. 10 179 206 [5] Bouř P 2016 Development of statistical nonparametric and divergence methods for data processing in D0 and NOvA experiments (Prague: FNSPE CTU) p 66