Some Measures of Agreement Between Close Partitions

Similar documents
Estimation of the large covariance matrix with two-step monotone missing data

Introduction to Probability and Statistics

CHAPTER 5 STATISTICAL INFERENCE. 1.0 Hypothesis Testing. 2.0 Decision Errors. 3.0 How a Hypothesis is Tested. 4.0 Test for Goodness of Fit

4. Score normalization technical details We now discuss the technical details of the score normalization method.

7.2 Inference for comparing means of two populations where the samples are independent

A Comparison between Biased and Unbiased Estimators in Ordinary Least Squares Regression

AN OPTIMAL CONTROL CHART FOR NON-NORMAL PROCESSES

Hotelling s Two- Sample T 2

On split sample and randomized confidence intervals for binomial proportions

Lower Confidence Bound for Process-Yield Index S pk with Autocorrelated Process Data

MODELING THE RELIABILITY OF C4ISR SYSTEMS HARDWARE/SOFTWARE COMPONENTS USING AN IMPROVED MARKOV MODEL

Plotting the Wilson distribution

Combining Logistic Regression with Kriging for Mapping the Risk of Occurrence of Unexploded Ordnance (UXO)

arxiv: v1 [physics.data-an] 26 Oct 2012

Supplementary Materials for Robust Estimation of the False Discovery Rate

AI*IA 2003 Fusion of Multiple Pattern Classifiers PART III

Slides Prepared by JOHN S. LOUCKS St. Edward s s University Thomson/South-Western. Slide

Solved Problems. (a) (b) (c) Figure P4.1 Simple Classification Problems First we draw a line between each set of dark and light data points.

A Social Welfare Optimal Sequential Allocation Procedure

Uniform Law on the Unit Sphere of a Banach Space

Econ 3790: Business and Economics Statistics. Instructor: Yogesh Uppal

Tests for Two Proportions in a Stratified Design (Cochran/Mantel-Haenszel Test)

arxiv: v2 [stat.me] 3 Nov 2014

Research Note REGRESSION ANALYSIS IN MARKOV CHAIN * A. Y. ALAMUTI AND M. R. MESHKANI **

Econ 3790: Business and Economics Statistics. Instructor: Yogesh Uppal

Use of Transformations and the Repeated Statement in PROC GLM in SAS Ed Stanek

State Estimation with ARMarkov Models

Maxisets for μ-thresholding rules

SIMULATION OF DIFFUSION PROCESSES IN LABYRINTHIC DOMAINS BY USING CELLULAR AUTOMATA

Robustness of multiple comparisons against variance heterogeneity Dijkstra, J.B.

Monte Carlo Studies. Monte Carlo Studies. Sampling Distribution

Uncorrelated Multilinear Principal Component Analysis for Unsupervised Multilinear Subspace Learning

Ratio Estimators in Simple Random Sampling Using Information on Auxiliary Attribute

CHAPTER-II Control Charts for Fraction Nonconforming using m-of-m Runs Rules

MULTIVARIATE STATISTICAL PROCESS OF HOTELLING S T CONTROL CHARTS PROCEDURES WITH INDUSTRIAL APPLICATION

Evaluating Circuit Reliability Under Probabilistic Gate-Level Fault Models

The Binomial Approach for Probability of Detection

Ecological Resemblance. Ecological Resemblance. Modes of Analysis. - Outline - Welcome to Paradise

MATH 2710: NOTES FOR ANALYSIS

General Linear Model Introduction, Classes of Linear models and Estimation

Chapter 3. GMM: Selected Topics

Maximum Entropy and the Stress Distribution in Soft Disk Packings Above Jamming

Background. GLM with clustered data. The problem. Solutions. A fixed effects approach

ASYMPTOTIC RESULTS OF A HIGH DIMENSIONAL MANOVA TEST AND POWER COMPARISON WHEN THE DIMENSION IS LARGE COMPARED TO THE SAMPLE SIZE

Using the Divergence Information Criterion for the Determination of the Order of an Autoregressive Process

Chapter 7 Sampling and Sampling Distributions. Introduction. Selecting a Sample. Introduction. Sampling from a Finite Population

On Wald-Type Optimal Stopping for Brownian Motion

Lecture: Condorcet s Theorem

Distributed Rule-Based Inference in the Presence of Redundant Information

Metrics Performance Evaluation: Application to Face Recognition

Approximating min-max k-clustering

Finite Mixture EFA in Mplus

Published: 14 October 2013

Asymptotic Properties of the Markov Chain Model method of finding Markov chains Generators of..

Notes on Instrumental Variables Methods

#A64 INTEGERS 18 (2018) APPLYING MODULAR ARITHMETIC TO DIOPHANTINE EQUATIONS

Recursive Estimation of the Preisach Density function for a Smart Actuator

Numerical and experimental investigation on shot-peening induced deformation. Application to sheet metal forming.

One-way ANOVA Inference for one-way ANOVA

s v 0 q 0 v 1 q 1 v 2 (q 2) v 3 q 3 v 4

Elementary Analysis in Q p

Robustness of classifiers to uniform l p and Gaussian noise Supplementary material

Bayesian inference & Markov chain Monte Carlo. Note 1: Many slides for this lecture were kindly provided by Paul Lewis and Mark Holder

Distributed K-means over Compressed Binary Data

MATHEMATICAL MODELLING OF THE WIRELESS COMMUNICATION NETWORK

Hypothesis Test-Confidence Interval connection

The Longest Run of Heads

A Bound on the Error of Cross Validation Using the Approximation and Estimation Rates, with Consequences for the Training-Test Split

An Improved Calibration Method for a Chopped Pyrgeometer

Pairwise active appearance model and its application to echocardiography tracking

Multilayer Perceptron Neural Network (MLPs) For Analyzing the Properties of Jordan Oil Shale

Statics and dynamics: some elementary concepts

t 0 Xt sup X t p c p inf t 0

Sums of independent random variables

Deriving Indicator Direct and Cross Variograms from a Normal Scores Variogram Model (bigaus-full) David F. Machuca Mory and Clayton V.

Bayesian Spatially Varying Coefficient Models in the Presence of Collinearity

Linear diophantine equations for discrete tomography

An Investigation on the Numerical Ill-conditioning of Hybrid State Estimators

Feedback-error control

Estimation of Separable Representations in Psychophysical Experiments

DIFFERENTIAL evolution (DE) [3] has become a popular

2-D Analysis for Iterative Learning Controller for Discrete-Time Systems With Variable Initial Conditions Yong FANG 1, and Tommy W. S.

DISCRIMINANTS IN TOWERS

A Recursive Block Incomplete Factorization. Preconditioner for Adaptive Filtering Problem

Damage Identification from Power Spectrum Density Transmissibility

Sign Tests for Weak Principal Directions

Numerical Linear Algebra

Participation Factors. However, it does not give the influence of each state on the mode.

Estimating function analysis for a class of Tweedie regression models

AR PROCESSES AND SOURCES CAN BE RECONSTRUCTED FROM. Radu Balan, Alexander Jourjine, Justinian Rosca. Siemens Corporation Research

The Fekete Szegő theorem with splitting conditions: Part I

Sensitivity and Robustness of Quantum Spin-½ Rings to Parameter Uncertainty

An Analysis of Reliable Classifiers through ROC Isometrics

Applied Probability Trust (24 March 2004) Abstract

Uncorrelated Multilinear Discriminant Analysis with Regularization and Aggregation for Tensor Object Recognition

Scaling Multiple Point Statistics for Non-Stationary Geostatistical Modeling

Probability Estimates for Multi-class Classification by Pairwise Coupling

SIMULATED ANNEALING AND JOINT MANUFACTURING BATCH-SIZING. Ruhul SARKER. Xin YAO

substantial literature on emirical likelihood indicating that it is widely viewed as a desirable and natural aroach to statistical inference in a vari

CERIAS Tech Report The period of the Bell numbers modulo a prime by Peter Montgomery, Sangil Nahm, Samuel Wagstaff Jr Center for Education

Transcription:

Some Measures of Agreement Between Close Partitions Genane Youness and Gilbert Saorta CEDRIC CNAM, BP - 466, Beirut, Lebanon, genane99@hotmail.com Chaire de Statistique Aliquée- CEDRIC, CNAM, 9 rue Saint Martin, 75003 Paris, France, saorta@cnam.fr Abstract. In order to measure the similarity between two artitions coming from the same data set, we study extensions of the RV-coefficient, the aa coefficient roosed by Cohen (in case of artitions with same number of classes, and the D coefficient roosed by Poing. We find that the RV coefficient is identical to the Janson and Vegelius index. We comare the result coming from aa s coefficient to the ordination given by corresondence analysis. We study the emirical distribution of these indices under the hyotheses of a common artition. For this urose, we use data coming from a latent rofile model to formulate the null hyothesis. Keywords. RV, Janson and Vegelius index, Cohen s Kaa, Poing D, Corresondence analysis, latent class. Introduction The roblem of measuring the agreement between two artitions of a same data set has attracted interest and reaears continually in the literature of classification. The most useful agreement measured roosed by Rand (97 has been rediscovered and modified by [FOW 83]. Based on the comarison of object triles, a measure of artition corresondence was introduced by [HUB 85]. A generalized Rand index method for consensus clustering was roosed by [KRI 99] for finding an amalgamated clustering of a set of contributory artitions. By using mathematical and statistical asects in the standardization of the comarison coefficients, [LER 88] shows how to tae into account the relational constraint, which results from the artition structure. A simle way to comare artitions is to study the emirical distribution of a measure of agreement under some null hyothesis. It is necessary to define measures of agreement between artitions, before testing the agreement. In revious aers [SAP 0,0], [YOU 03], we have examined the following indices: Rand, Mc Nemar, and Jaccard. In this aer, we resent new aroaches to find the similarity between clustering. The RV-coefficient introduced by Robert, P. and Escoufier, Y. [ROB 76] is roosed and written in term of aired comarisons. We rove that RV-coefficient is identical to the JV index [JAN 8]. The aa coefficient roosed by Cohen [COH 60] is another way to measure the agreement between artitions with equal number of classes. Kaa can be used only in situations where categories are secified in advance, which is not the case of artitions where the labels of the classes are arbitrary: for that, we identify the ermutation of classes of one of the artitions by maximizing the aa values. We comare the otimal ermutation with the raning given by corresondence analysis: the results are the same. Finally we also study the D index roosed by Poing [POP 83]. This less nown index is based on comarison of airs of subjects. We find the emirical distribution of this agreement measure when the artitions only differ by chance.

. Notations Let P and P be two artitions (or two categorical variables of the same subjects with and q classes. If K and K are the disjunctives nx and nxq tables and N the corresonding contingency table with elements ' n, we have: N K K Each artition P is also characterized by the n n aired comarison table C with general term ' : c if i and i' are in the same class of P, ii' 0 otherwise Note that c ii, i N ' ' We have C KK and C K K Given n subjects, n(n-/ airs of subjects can be comared. When both artitions assign airs of subjects to the same classes, we consider this as an agreement. Table. The four cases P \ P Same class Different class Same class a Agreement (same Different class d Disagreement c Disagreement b Agreement (Different c ii 3. RV- coefficient Robert, P. and Escoufier, Y. [ROB 76] have derived a measure of similarity of two configurations, called RV-coefficient or the vector correlation coefficient; it allows to measure the similarity between two data tables X and X of the same observations by comaring the scalar roduct between individuals associated to the two tables. If the scalar matrices roduct X i X i is noted as W i with dimension nxn, the RV-coefficient is defined as: RV (X, X trace (W W trace (W trace (W When alied to the disjunctive tables associated to two artitions, we find: RV(P,P trace (C trace (C C trace (C (c ii ' (c ii ' i i' i i' (c ii ' ( c i i' ii ' q q ( A high value of RV may lead to the conclusions that artitions are almost identical. Lazraq, A. and Cleroux, R. [LAZ 0,0] have roosed a test for the null hyothesis that the theoretical (oulation RV is zero, but only in the case of numerical data, which cannot be alied to indicator variables.

3. JV index. The JV index is a measure of association roosed by Janson, S. and Vegelius, J. [JAN 8]. Its initial exression is: q nuv n u. q n.v + n u v u v JV(P,P [( n + n ][q(q n + n ] It has been roved that, in terms of aired comarisons: JV(P,P u u. (c (c ii' (c ii' i,i' i,i' Idrissi [IDR 00] used this formula to study the robability distribution of JV under the hyothesis of indeendence. If the classes are equirobable, one finds that c ii' c ii' follows a binomial distribution i i' B(n(n-,. Idrissi claims that the JV index between two categorical variables with equirobable modalities has an exected value equal to: ii' (c E (JV n The RV coefficient and the JV index are identical, in terms of aired comarisons. i,i' ii' q v q.v 4. Cohen s Kaa. The aa coefficient for comuting nominal scale agreement between two raters was first roosed by Cohen [COH 60]. He defined his index as the roortion of agreement after chance agreement is removed from consideration. P o Pe κ P Where P 0 n i n i n P e n n ii i.. i In this formula, P o is the amount of observed agreement. In case of indeendence, given the marginals, the exected amount of agreement is P e. Therefore, a correction is made for this amount agreement. In order that the index assumes the value in case of erfect agreement, the correction is also made in the denominator. It measures the deviation of objects on the diagonals in the agreement table. We use the equivalent formula of the aa coefficient to comute the agreement between two artitions having the same number of classes: e 3

n n n n ii i.. i i i κ n n i. n. i i An imortant condition to use the aa coefficient is for the situation in which the identity of classes er observations is nown in advance. In our case, when we comare artitions coming from the same data set and found by the clustering methods, the numbering of classes is totally arbitrary: so we roose to identify the classes of the artitions from the maximum value of κ. We ermute the columns or rows in the cross table and we calculate κ at each time; we tae the numbering of ermutation with the maximum κ. Another method to find the otimal ordering is to use the order of the categories of both variables given by the first factor in the simle corresonding analysis. This method maximizes the weight of the diagonal of the cross table by ermuting their lines and columns. We comare the result found by the simle corresonding analysis to our method. Often, the same value of κ can be observed. 5. Poing s D The D index was roosed by Poing, R. [POP 83] and is based on the same rinciles of aa coefficient. It is used to study the agreement between the nominal classifications by two judges who indeendently categorize the same subjects in case that the categories are not nown in advance. The D index is based on the comarison of airs of subjects and start from the situation of same agreement (Table. The index contains a correction for agreement that can be exected on the basis of chance given the marginals of the original classification. Under the null hyothesis of indeendence, the D is a transformation of the Russel and Rao s coefficient a (. a + b + c + d The D index is defined as: where Do De c q n i j (n n(n q c i j n(n D D D g (h 0,5g 0,5 where o m De D ni. (ni. n. j(n. j i D, D q n(n n(n e ni..n. j h and g integer(h n and D m Max(D,D q Note that in D one considers D e as a reasonable minimum [POP 83], so for situations where we use the g in D e, we have no emirical roof to get an exected value smaller than the minimum. This is in favor of g 4

instead of the fractional number h. However, this results in the fact that D e is biased: the results are too high. The consequence for D is that the index will tae conservative values. The D index has been comared to the aa coefficient to the JV- index, in the articular following case: Table. The articular case found by Poing Categories + - Total + u v-u v - v-u u v Total v v v u v Poing obtained the results: κ v u v D v JV In the general case we found that there are a strong correlation between these indices in the simulation study. We find a relation between the Jaccard s index nown as measure of similarity between objects described by resence-absence attributes [SAP 0] to the D coefficient, we find the following relation: In case of Integer(h h D n (C bj max(a + c,a + d C n a where J a + c + d and C n n(n 6. Emirical distribution We use the algorithm resented in [SAP 0] for finding the emirical distribution of the indices of association when the two artitions only differ by chance. The difficulty consists in concetualising a null hyothesis of identical artitions and a rocedure to chec it. Note that dearture from indeendence does not mean that there exists a strong enough agreement. Now we have to define the sentence two artitions are identical. Our aroach consists to say that the units come from a same artition, where the two observed artitions are noised. The latent class model is well adated to this roblem for getting artitions. Note that Green and Krieger [GRE 99] have used it in their consensus artition research. More recisely, we use the latent rofiles models because we have numerical variables For getting near-identical artitions, we suose the existence of a common artition for the oulation according to a latent rofile model. The basic hyothesis is the indeendency of observed variables conditional to the latent classes that gives similar artitions from one or another grous of variables: f ( x π f ( x / j j 5

Theπ are the class roortions and x is the random vector of observed variables, where the comonent x j are indeendent in each class. We use here the latent class model to generate the data and not to estimate arameters. Once having choosen the number of classes, we first generate their frequencies from a multinomial distribution with robabilities π, and then we generate observations in each class according to the local indeendence model in other words a normal mixture model with indeendent comonents in each class. Then we slit arbitrarily the variables into two sets and erform a artitioning algorithm on each set. The two artitions should thus differ only by random. We calculate the indices for the two artitions: we reeat the rocedure N times to find a samling distribution of aa, RV (or J index, and D. Our algorithm has the following stes:. Generate the sizes n, n,.., n of the clusters according to a multinomial distribution M(n; π π. For each cluster, generate n i values from a random normal vector with indeendent comonents 3. Get two artitions: P of the units according to the first variables and P according to the last - variables. 4. Comute association measures (for RV, J and D for P and P 4. Permute the columns of the cross table (P, P to find the maximum value of aa, so the numbering of classes. 5. Reeat the rocedure N times. 7. Numerical Alications We alied the revious rocedure with 4 equirobable latent classes, 000 units and 4 variables. We obtained the two artitions P (with X and X and P (with X 3 and X 4 with 4 classes by the -means methods. We resent only one of our simulations (erformed with S+ software. Table 3. The normal mixture model Class n Class n Class n 3 Class n 4 X N(.,.5 X N( -,.5 X N( -5,.5 X N(-8,.5 X N(4,.5 X N(-4,.5 X N(-0,.5 X N(-5,.5 X3 N(7,3.5 X3 N(-6,3.5 X3 N(-3,3.5 X3 N(-0,3.5 X4 N(0,4.5 X4 N(-0,4.5 X4 N(-0,4.5 X4 N(-30,4.5 The following figure shows the satial distribution of one of the 000 iterations for the two artitions P and P. Comonent -6-4 - 0 4 6 Comonent -5-0 -5 0 5 0 5-0 0 0 Comonent These two comonents exlain 00 % of the oint variability. -0 0 0 40 Comonent These two comonents exlain 00 % of the oint variability. Figure. The first two rincial comonents of one of the 000 samles for the P and P 6

The cross table of the two artitions in one of the simulations is reresented as follows: Table 4. Cross tabulation of P and P for one of the simulation 3 4 48 0 0 98 7 9 6 43 0 0 58 9 We find a value of the aa coefficient equal to 0.335, and a value of JV index (or RV equal to 0.648. The value of the D index is equal to 0.647. To identify the labels of classes of P to those of P, we ermute the columns (4! ermutations of the cross table: The maximum value of the aa coefficient is 0.787, and the numbering of column s ermutation is,,4,3. So the table becomes: Table 5. Cross tabulation with the new ermutation of columns 4 3 48 0 0 98 9 7 6 0 43 0 58 9 Another way of getting the otimal ordering is by means of corresondence analysis: categories of both variables are ordered according to their coordinates along the first axis. Table 6. Order according to the first factor of CA 98 7 9 58 9 0 6 43 0 0 0 48 If we calculate the aa coefficient of this last table, we find the value 0.787. Here corresondence analysis ordering gives the same result as our method. The distribution of maximum aa coefficient values is resented below: 7

400 300 00 00 0 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 KAPPA 0.3 0.4 0.5 0.6 0.7 0.40 0.45 0.50 0.55 0.60 0.65 0.70 Figure 3: Distribution of aa for 000 individuals and 000 iterations, artitions with 4 classes. The scatter lot of JV against D in the 000 iterations. With the same choice of normal indeendent mixtures variables, we find that the aa coefficient varies between 0.4 and 0.875. Its means is 0.8. By simulation, the value of the JV index varies between 0.4 and 0.7. The most frequent value is 0.63 and the mean is equal to 0.67. (Fig. 4 Under the hyothesis of indeendence E(JV0.003, and with 000 observations, indeendence should have been rejected for JV>0.67 at 5% level. The 5% critical value is much higher than the corresonding one in the indeendence case. It shows that dearture from indeendence does not mean that the two artitions are close enough. The D index taes its values between 0.3 and 0.7 with a mode equal to 0.65. The distribution has a mean equal to 0.60. Bimodality is due to the use of -means, which gives a local otima [YOU 04]. There is a strong correlation between the JV and D equal to 0.983. So we can say that the two indices give same result in comaring artitions (Fig.3 Under the null hyothesis of close artitions, all indices have values around the mean, which is close to 0.6 so that we could say that the two artitions P and P are close enough. 8

0 50 00 50 00 50 300 0 00 00 300 400 500 600 0.4 0.5 0.6 0.7 JV 0.3 0.4 0.5 0.6 0.7 D Figure. The JV- index distribution and D for the artitions of 4 classes in 000 iterations 8. Conclusion In this aer, we have roved the identity between the RV coefficient and the JV- index for comaring two artitions. A latent class model has been used to solve the roblem of comaring close artitions and three agreement indices: JV (or RV, aa and D have been studied. The aa coefficient allows to test the similarity between two artitions after ermutation and when we they have the same number of classes. We comare the otimal ermutation with the order given by corresondence analysis: By simulation, it gives often the same results. The D index seems useful for comaring the classification of two artitions in the case that the identification of classes is not nown in advance. D gives stress on airs in the same class of both artitions, as Jaccard s index. We found a relation between them. Since we found a strong correlation between the JV index and the D in our simulation, we can deduce that JV (or RV and D lead us to the same result in comaring artitions. The distributions of these roosed indices have been found very different from the case of indeendence and are bimodal. The bimodality might be exlained by the resence of local otima in the -means algorithm. 9. References [COH 60] COHEN, J., A Coefficient of Agreement for Nominal Scales., Educ. Psychol. Meas., vol 0, 7-46, 960. [FOW 83] FOWLKES, E.B. AND MALLOWS, C.L., A Method for Comaring Two Hiearchical Clusterings, Journal of American Statistical Association, vol. 78, 553-569, 983. [GRE 99] GREEN, P., KRIEGER, A. A Generalized Rand-Index Method for Consensus Clustering of Searate Partitions of the Same Data Base, Journal of Classification, vol.6, 63-89, 999. [HUB 85] HUBERT, L., ARABIE, P., Comaring Partitions, Journal of Classification,vol., 93-8, 985. [IDR 00] IDRISSI A. Contribution à l Unification de Critères d Association our Variables Qualitatives, Thèse de doctorat de l Université de Paris 6, 000. 9

[LAZ 0] LAZRAQ, A., CLEROUX R.. Inférence Robuste sur un Indice de Redondance, Revue de Statistique Aliquée, vol.4, 39-54, 00. [LER 73] LERMAN, I.C, Etude Distributionnelle de Statistiques de Proximité entre Structures Finies de Memes Tyes; Alication à la Classification Automatique, Cahier no.9 du Bureau Universitaire de Recherche Oérationelle, Institut de Statistique des Universités de Paris, 973. [LER 88] LERMAN, I.C, Comaring Partitions (Mathematical and Statistical Asects, Classification and Related Methods of Data Analysis, H.H Boc Editor, -3, 988. [JAN 8] JANSON S., VEGELIUS J. The J- Index as a Measure of Association For Nominal Scale Resonse Agreement. Alied sychological measurement, vol.6, 43-50, 98. [POP 83] POPPING, R. Traces of Agreement. On the Dot-Product as a Coefficient of Agreement. Quality and Quantity, vol 7 (, -8, 983. [POP 9] POPPING, R. Taxonomy on Nominal Scale Agreement, Groningen ; iec ProGAMMA, 945-990, 99,. [ROB 76] ROBERT P., ESCOUFIER, Y. A Unifying Tool for Linear Multivariate Statistical Methods: The RV-Coefficient. Al. Statist., vol. 5, 57-65, 976. [SAP 0] SAPORTA G., YOUNESS G. Concordance entre Deux Partitions: Quelques Proositions et Exériences, in Proceedings SFC 00, 8èmes rencontres de la Société Francohone de Classification, Pointe à Pitre, 00. [SAP 0] SAPORTA G., YOUNESS G. Comaring Two Partitions: some roosals and Exeriments, Proceedings in Comutational Statistics edited by Wolfgang Härdle, Physica- Verlag, Berlin, 00. [YOU 04] YOUNESS G., SAPORTA G., Une Méthodologie Pour la Comaraison de Partitions, Revue de Statistique Aliquée, à araître. 0