Multivariate Histogram Comparison

Similar documents
Irr. Statistical Methods in Experimental Physics. 2nd Edition. Frederick James. World Scientific. CERN, Switzerland

Confidence intervals for the parameter of Poisson. distribution in presence of background

Hypothesis testing:power, test statistic CMS:

Confidence distributions in statistical inference

Chapte The McGraw-Hill Companies, Inc. All rights reserved.

PoS(ICRC2017)522. Testing the agreement between the X max distributions measured by the Pierre Auger and Telescope Array Observatories

Recall the Basics of Hypothesis Testing

Appendix F. Computational Statistics Toolbox. The Computational Statistics Toolbox can be downloaded from:

A new class of binning-free, multivariate goodness-of-fit tests: the energy tests

Monte Carlo Methods. Handbook of. University ofqueensland. Thomas Taimre. Zdravko I. Botev. Dirk P. Kroese. Universite de Montreal

The Monte Carlo method what and how?

Statistical techniques for data analysis in Cosmology

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3

Statistics Toolbox 6. Apply statistical algorithms and probability models

in NA64 at the CERN SPS. DSU-2018 June 2018 Annecy Search for a new X-boson and Dark Photons in NA64

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

POLI 443 Applied Political Research

Overview. DS GA 1002 Probability and Statistics for Data Science. Carlos Fernandez-Granda

Monte Carlo Studies. The response in a Monte Carlo study is a random variable.

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

STATISTICS ( CODE NO. 08 ) PAPER I PART - I

Statistic Distribution Models for Some Nonparametric Goodness-of-Fit Tests in Testing Composite Hypotheses

Karl-Rudolf Koch Introduction to Bayesian Statistics Second Edition

Statistical Methods for Calculating the Risk of Collision between Petroleum Wells Bjørn Erik Loeng and Erik Nyrnes, Statoil ASA

COPYRIGHTED MATERIAL CONTENTS. Preface Preface to the First Edition

Performance Evaluation and Comparison

Lecture 3. G. Cowan. Lecture 3 page 1. Lectures on Statistical Data Analysis

One-Sample Numerical Data

Systematic uncertainties in statistical data analysis for particle physics. DESY Seminar Hamburg, 31 March, 2009

Statistics for Particle Physics. Kyle Cranmer. New York University. Kyle Cranmer (NYU) CERN Academic Training, Feb 2-5, 2009

Distribution Fitting (Censored Data)

Bayesian Regression Linear and Logistic Regression

A DIAGNOSTIC FUNCTION TO EXAMINE CANDIDATE DISTRIBUTIONS TO MODEL UNIVARIATE DATA JOHN RICHARDS. B.S., Kansas State University, 2008 A REPORT

Signal Significance in the Presence of Systematic and Statistical Uncertainties

Machine Learning using Bayesian Approaches

Testing Homogeneity Of A Large Data Set By Bootstrapping

Lecture 2. G. Cowan Lectures on Statistical Data Analysis Lecture 2 page 1

MATH 240. Chapter 8 Outlines of Hypothesis Tests

Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institute of Technology, Kharagpur

Modeling Hydrologic Chanae

Exploring Monte Carlo Methods

2D Image Processing (Extended) Kalman and particle filter

Subject CS1 Actuarial Statistics 1 Core Principles

Testing Statistical Hypotheses

Human-Oriented Robotics. Probability Refresher. Kai Arras Social Robotics Lab, University of Freiburg Winter term 2014/2015

Some Statistical Tools for Particle Physics

Bivariate Rainfall and Runoff Analysis Using Entropy and Copula Theories

Numerical Analysis for Statisticians

Multivariate statistical methods and data mining in particle physics

Statistics for the LHC Lecture 2: Discovery

Bayesian Models in Machine Learning

Interpreting Regression Results

Notation Precedence Diagram

AST 418/518 Instrumentation and Statistics

arxiv:math/ v1 [math.pr] 9 Sep 2003

Large Scale Environment Partitioning in Mobile Robotics Recognition Tasks

SIMULATION SEMINAR SERIES INPUT PROBABILITY DISTRIBUTIONS

Goodness of Fit Tests for Rayleigh Distribution Based on Phi-Divergence

CSC 2541: Bayesian Methods for Machine Learning

Analysis of Statistical Algorithms. Comparison of Data Distributions in Physics Experiments

Statistical Methods for Particle Physics Tutorial on multivariate methods

Introduction to Probability and Statistics (Continued)

Search for the Higgs Boson In HWW and HZZ final states with CMS detector. Dmytro Kovalskyi (UCSB)

Application of Parametric Homogeneity of Variances Tests under Violation of Classical Assumption

IDL Advanced Math & Stats Module

Bootstrapping, Randomization, 2B-PLS

Practical Statistics

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

Basic Computations in Statistical Inference

Direct Simulation Methods #2

Testing Statistical Hypotheses

Physics 509: Bootstrap and Robust Parameter Estimation

A MONTE CARLO INVESTIGATION INTO THE PROPERTIES OF A PROPOSED ROBUST ONE-SAMPLE TEST OF LOCATION. Bercedis Peterson

unadjusted model for baseline cholesterol 22:31 Monday, April 19,

Experimental Design and Data Analysis for Biologists

Research Article The Laplace Likelihood Ratio Test for Heteroscedasticity

Curve Fitting Re-visited, Bishop1.2.5

Goodness-of-Fit Pitfalls and Power Example: Testing Consistency of Two Histograms

A comparison study of the nonparametric tests based on the empirical distributions

Robert Collins CSE586, PSU Intro to Sampling Methods

Spatial Point Pattern Analysis

Statistical Methods for Particle Physics Lecture 3: Systematics, nuisance parameters

Parametric fitting of data obtained from detectors with finite resolution and limited acceptance arxiv: v1 [physics.data-an] 2 Nov 2010

Introduction to statistics. Laurent Eyer Maria Suveges, Marie Heim-Vögtlin SNSF grant

Comparing Bayesian Networks and Structure Learning Algorithms

Data modelling Parameter estimation

APPLICATION AND POWER OF PARAMETRIC CRITERIA FOR TESTING THE HOMOGENEITY OF VARIANCES. PART IV

A note on vector-valued goodness-of-fit tests

Application of Homogeneity Tests: Problems and Solution

Basic Statistical Tools

Dr. Maddah ENMG 617 EM Statistics 10/15/12. Nonparametric Statistics (2) (Goodness of fit tests)

EE/CpE 345. Modeling and Simulation. Fall Class 10 November 18, 2002

15-388/688 - Practical Data Science: Basic probability. J. Zico Kolter Carnegie Mellon University Spring 2018

Application of Copulas as a New Geostatistical Tool

Bayesian Inference for the Multivariate Normal

Modified Kolmogorov-Smirnov Test of Goodness of Fit. Catalonia-BarcelonaTECH, Spain

Fall 2012 Analysis of Experimental Measurements B. Eisenstein/rev. S. Errede

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

CMS Internal Note. The content of this note is intended for CMS internal use and distribution only

Visual interpretation with normal approximation

Transcription:

Multivariate Histogram Comparison S.I. BITYUKOV, V.V. SMIRNOVA (IHEP, Protvino), N.V. KRASNIKOV (INR RAS, Moscow) IML LHC Machine Learning WG Meeting May 17, 2016 CERN, Switzerland 1

Histogram Suppose, there is given a set of non-overlapping intervals. A histogram represents the frequency distribution of data which populates those intervals. This distribution is obtained during data processing of the sample (which is taken from the flow of events) with observed values of random variable). These intervals usually are called as bins. The filling: two extreme cases: a) one event - one histogram, b) one event one value is put to histogram. The filling of histogram in the case b is a chain of independent measurements with gradual filling of the histogram. We consider the multivariate approach which is proposed in the method of Statistical Comparison of Histograms (SCH) (arxiv:1302.2651, EPJ+ 128 (2013) 143). 2

Distance between histograms Given two histograms, how do we assess whether they are similar or not? What does it means "similar"? Several standard procedures exist for this task. Suppose, a reference histogram is known. Usually, the proximity of test histogram and reference histogram is measured via a test statistics, that provides the quantitative expression of the ``distance'' between histograms. The smaller the distance the more similar they are. There are several definitions of distance in the literature (Int. Journ. of economics and statistics, 4 (2016) 98), for example, the Kolmogorov distance, the Kullback-Leibner, the chi-square distance and so on. Usually, it is the some test statistics, distribution of which can be calculated via formulae or constructed by Monte Carlo. 3

Distinguishability or consistency Often a goal of histograms comparison is a testing of their consistency. Consistency here is the statement that both histograms are produced during data processing of independent samples which are taken from the same flow of events (or from the same population of events). The method of statistical comparison of histograms (SCH) allows to estimate the distinguishability of histograms and, correspondingly, the distinguishability of parent events flows (or parent samples). We use the distribution of some test statistics (significances of difference) instead of single test statistics in other methods. This distribution has statistical moments (the mean, RMS, asymmetry, excess,...), i.e. the distribution can be considered as multivariate test statistics with, for example, the mean and RMS as two coordinates and, in principle, any other univariate test statistics as additional coordinates. 4

Significance of the difference I 5

Significance of the difference II 6

Hypotheses testing I If the goal of the comparison of histograms is the check of their consistency, then task is reduced to hypotheses testing: main hypothesis H0 (histograms are produced during data processing of samples taken from the same flow of events) against alternative hypothesis H1 (histograms are produced during data processing of samples taken from different flows of events). The determination of critical area allows to estimate Type I error (α) and Type II error (β) in decision about choice between H0 and H1. The Type I error (α) is a probability of mistake if done choice is H1, but H0 is true. The Type II error (β) is a probability of mistake if done choice is H0, but H1 is true. 7

Hypotheses testing II The selection of a significance level (α) allows to estimate the power of the test (1-β). Usually, values of significance level are 10%, 5%, 1%. If both hypotheses are equivalent, then other combinations of the α and β are used. For example, in task about distinguishability of the flows of events works a relative uncertainty (α+β)/(2-(α+β)) for α+β 1. Under the test of equal tails the mean error (α+β)/2 can be used too. The hypotheses testing require the knowledge of the distribution of test statistics. As mentioned above the distribution of test statistics can be constructed by Monte Carlo. 8

Rehistogramming I If errors of values in bins of at least one of histogram are known (for example, reference histogram), than one can construct the set of similar histograms (clones), which imitates the population of histograms which produced due to data processing of the samples taken from the same flow of events. This set of histograms is used for construction of the distribution of test statistics for the case of H0 hypothesis (due to comparison of the reference histogram and the produced clones - some kind of calibration). This procedure can be named as "rehistogramming" in analogy with "resampling" in bootstrap technique. The correctness of this procedure follows, for example, from the properties of the statistically dual distributions (Applied Mathematics, 5 (2014) 963): normal with fixed variance, Poisson and Gamma, Cauchy, Laplace,. 9

Rehistogramming II Further the set of histograms of such type is constructed for test histogram (second histogram). New set is used for construction of the distribution of test statistics for the case of H1 hypothesis (due to comparison of the reference histogram and the produced clones of second histogram). The comparison of the distribution of test statistics for the case of H0 hypothesis and the distribution of test statistics for the case of H1 hypothesis allows to estimate the uncertainty in hypotheses testing. 10

Advantages of this approach 1. We have a measure of the distance between histograms. It is relative uncertainty of the decision about consistence or distinguishability of histograms. 2. We can compare multidimensional histograms likewise as unidimensional histograms. 3.We can compare two sets of several histograms simultaneously likewise as we compare a pair of histograms. 4.We can use any unidimensional test statistics (Kolmogorov-Smirnov, Anderson-Darling, ) as additional dimension in proposed multivariate test statistics. Let us consider the example of multivariate comparison. 11

Monte Carlo experiment Two pairs (reference pair and test pair) of independent flows of samples with realizations of random variables (each realization is ``event'') are produced to estimate the possibility of methods for distinguishing of samples from different information flows. The volume of each flow equals 5000 samples. First flow from each pair is a reference flow of samples with 1000 events (1000 realizations of random variable N(300,50) ). Second flow from first pair also is flow of samples with 2000 events (realizations of the same random variable) (Fig.A, left). Second flow from second pair is test flow of samples with 2000 of events (realizations of random variable N(300,44) ). The examples of distributions in samples are shown in Fig.A (right). 12

Examples of distributions in samples Fig.A Examples of distributions from two pairs of flows: left histogram - 1000 realizations of reference random variable N(300,50) and 2000 realizations of test random variable N(300,50) ), right histogram - 1000 realizations of reference random variable N(300,50) and 2000 realizations of test random variable N(300,44) ). 13

SCH approach In the case of Statistical Comparison of Histograms (SCH) method for each sample is constructed the histogram. After that for each pair of samples the comparison of histograms is performed with calculation of the mean value (<S>) of significances of the difference between corresponding bins of histograms and root mean square (rms) of the distribution of these ``significances''. The distribution of bidimensional values (<S>,rms) is constructed for each pair of samples from both pair of flows (see, Fig.B). After that the critical line (for equal tailed test) is calculated. Correspondingly, the values (Type I error) and (Type II error) are determined. The quality of the distinguishing is estimated as a relative uncertainty (α+β)/(2-(α+β)). 14

Hypotheses testing (SCH test) Fig.B Two distributions (<S>,rms) values for 5000 comparisons: the case of the comparison of samples from two similar flows (reference flow) and the case of the comparison of samples from reference flow and samples from test N(300,44) flow. 15

Conclusion The possibility of the use of the multivariate test statistics for the comparative analysis of histograms (or dependences) in frame of the statistical comparison method is considered. This method allows to use the multivariate test statistics for determination of the distance between histograms. This method allows to compare as multidimensional histograms and sets of histograms. In principle, the multivariate test-statistics can include any univariate test statistics as an additional coordinates. 16

Acknowledgements The authors are grateful to Prof. V. Kachanov, Prof. Yu. Korovin, Dr. S. Gleyzer, Dr. L. Moneta, Dr. N. Korneeva for helpful discussions. The work was supported by the Ministry of Education and Science RF (Agreement on October, 17, 2014 N 14.610.21.0004, id. PNIER RFMEFI61014X0004). 17

Backup slide 18

Statistical comparison of data sets Fig.C The comparison of the experimental data (triangles with errors and red spot) with results obtained in frame of two models (blue and green spots) [A. Maksimushkina, V. Smirnova, Method for data statistical visualization, Scientific Visualization, 7, Issue 5 (2015) 26-37]. 19