Statistical comparisons for the topological equivalence of proximity measures

Similar documents
A Topological Discriminant Analysis

Comparison of proximity measures: a topological approach

Statistical comparisons for the topological equivalence of proximity measures

Fuzzy order-equivalence for similarity measures

Proximity measures in topological structure for discrimination

Textbook Examples of. SPSS Procedure

Similarity measures for binary and numerical data: a survey

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees

Descriptive Data Summarization

Bivariate Relationships Between Variables

Distances & Similarities

N Utilization of Nursing Research in Advanced Practice, Summer 2008

Small n, σ known or unknown, underlying nongaussian

Relate Attributes and Counts

Similarity and Dissimilarity

the long tau-path for detecting monotone association in an unspecified subpopulation

Notion of Distance. Metric Distance Binary Vector Distances Tangent Distance

Iterative Laplacian Score for Feature Selection

Describing Contingency tables

Data Analysis as a Decision Making Process

Hypothesis Testing hypothesis testing approach

Motivating the Covariance Matrix

Multivariate Statistical Analysis

Finding Relationships Among Variables

Unit 14: Nonparametric Statistical Methods

Clustering VS Classification

Readings Howitt & Cramer (2014) Overview

Readings Howitt & Cramer (2014)

CS249: ADVANCED DATA MINING

Measuring relationships among multiple responses

University of Florida CISE department Gator Engineering. Clustering Part 1

Discrete Multivariate Statistics

UNIT 4 RANK CORRELATION (Rho AND KENDALL RANK CORRELATION

Feature selection and extraction Spectral domain quality estimation Alternatives

Intensity Analysis of Spatial Point Patterns Geog 210C Introduction to Spatial Data Analysis

Unsupervised machine learning

Learning Objectives for Stat 225

Type of data Interval-scaled variables: Binary variables: Nominal, ordinal, and ratio variables: Variables of mixed types: Attribute Types Nominal: Pr

Machine Learning on temporal data

2/19/2018. Dataset: 85,122 islands 19,392 > 1km 2 17,883 with data

τ xy N c N d n(n 1) / 2 n +1 y = n 1 12 v n n(n 1) ρ xy Brad Thiessen

CDA Chapter 3 part II

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Applied Multivariate Statistical Analysis Richard Johnson Dean Wichern Sixth Edition

Optimal exact tests for complex alternative hypotheses on cross tabulated data

Data Exploration and Unsupervised Learning with Clustering

Revision: Chapter 1-6. Applied Multivariate Statistics Spring 2012

Types of Statistical Tests DR. MIKE MARRAPODI

Dependence. MFM Practitioner Module: Risk & Asset Allocation. John Dodson. September 11, Dependence. John Dodson. Outline.

Multivariate Analysis of Crime Data using Spatial Outlier Detection Algorithm

CS626 Data Analysis and Simulation

proximity similarity dissimilarity distance Proximity Measures:

Hypothesis Testing For Multilayer Network Data

HYPOTHESIS TESTING: THE CHI-SQUARE STATISTIC

A note on top-k lists: average distance between two top-k lists

Introduction to Signal Detection and Classification. Phani Chavali

Comparing Dissimilarity Representations

Clustering Lecture 1: Basics. Jing Gao SUNY Buffalo

Ordinal Variables in 2 way Tables

BIOL 4605/7220 CH 20.1 Correlation

Hypothesis testing:power, test statistic CMS:

Linear and Non-Linear Dimensionality Reduction

On The General Distance Measure

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Statistics and Measurement Concepts with OpenStat

Multivariate Statistics Summary and Comparison of Techniques. Multivariate Techniques

How To Use CORREP to Estimate Multivariate Correlation and Statistical Inference Procedures

Semi Supervised Distance Metric Learning

Centrality Measures. Leonid E. Zhukov

Lecture 8: Summary Measures

Multivariate Statistical Analysis

Data Preprocessing. Cluster Similarity

Centrality Measures. Leonid E. Zhukov

Multivariate Analysis

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

A Geometric Interpretation of Gene Co-Expression Network Analysis. Steve Horvath, Jun Dong

Table of Contents. Multivariate methods. Introduction II. Introduction I

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics

Contents Kruskal-Wallis Test Friedman s Two-way Analysis of Variance by Ranks... 47

An Overview of Methods in the Analysis of Dependent Ordered Categorical Data: Assumptions and Implications

Genetic Networks. Korbinian Strimmer. Seminar: Statistical Analysis of RNA-Seq Data 19 June IMISE, Universität Leipzig

Data Analysis and Manifold Learning Lecture 9: Diffusion on Manifolds and on Graphs

Interaction Analysis of Spatial Point Patterns

Experimental Design and Data Analysis for Biologists

LOOKING FOR RELATIONSHIPS

The Common Factor Model. Measurement Methods Lecture 15 Chapter 9

Short Answer Questions: Answer on your separate blank paper. Points are given in parentheses.

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Fast Hierarchical Clustering from the Baire Distance

Data Mining and Analysis

Discussion of Hypothesis testing by convex optimization

Nonparametric Statistics Notes

Data Mining 4. Cluster Analysis

Probabilistic Methods in Bioinformatics. Pabitra Mitra

5. Discriminant analysis

Measurement and Data

Transcription:

Statistical comparisons for the topological equivalence of proximity measures Rafik Abdesselam and Djamel Abdelkader Zighed 2 ERIC Laboratory, University of Lyon 2 Campus Porte des Alpes, 695 Bron, France (e-mail: rafikabdesselam@univ-lyon2fr) 2 Human Sciences Institute, ISH - CNRS 697 Lyon, France (e-mail: AbdelkaderZighed@ish-lyoncnrsfr) Abstract In many application domains, the choice of a proximity measure directly affects the resulting data mining methods in the clustering, comparison or structuring a set of objects According to the notion of equivalence, like the one based on pre-ordering, some of the proximity measures are more or less equivalent, which means that they produce, more or less, the same results In this paper, we introduce a new approach to comparing proximity measures It is based on topological equivalence which exploits the concept of local neighbors It defines equivalence between two proximity measures as having the same neighborhood structure on the objects This information on equivalence might be helpful for choosing one such measure However, the complexity O(n 4 ) of pre-ordering approach makes it intractable when the size n of the sample exceeds a few hundred To cope with this limitation, we propose a topological approach with less complexity O(n 2 ) We illustrate our approach and compare empirically the two tests for thirteen proximity measures for continuous data from the literature Keywords: Proximity measure, dissimilarity and adjacency matrices, Spearman and Kappa tests, neighborhood graph, pre-order and topological equivalences Introduction The proximity measures are characterized by precise mathematical properties Are they so far, all equivalent? Can be used in practice so undifferentiated? In other words, is that, for example, the proximity measure between individuals immersed in a multidimensional space as R p, influence or not the result of operations? The aims of this paper is to compare the proximity measures them to detect those which are identical to those which are not To compare two proximity measures, the approach is, so far, to compare the values of proximity matrices induced [], [2] To compare two proximity measures, [5] focuses on the preorders induced by the two proximity measures and assess their degree of similarity by the concordance between the induced preorders on the set of pairs of objects We use a statistical test for comparing matrices associated with proximity measures based on the concept of neighborhood graphs It believes that

2 R Abdesselam, and D A Zighed two proximity measures are topological equivalent if they induce the same neighborhood structure on the objects The comparison matrix is a useful tool for measuring the degree of resemblance between two empirical proximity matrices Like the non-parametric Spearman s or Kendall s tests on matchedpairs of continuous data used to compare two matrices of dissimilarity for measuring the degree of equivalence in pre-ordonnance, we propose the nonparametric Kappa test to compare two adjacency binary matrix for measuring the degree of equivalence in topology To view these proximity measures, we propose, for example, to apply an algorithm to construct a hierarchy Proximity measures are grouped according to their degree of resemblance in pre-order and in topology 2 Proximity measures and equivalences In this article we limit our work to proximity measures built on R p Nevertheless, the approach could easily be extended to all kinds of data: quantitative or qualitative Let us consider a sample of n individuals x, y, in a space of p dimensions Individuals are described by continuous variables: x = (x,, x p ) Table Some proximity measures Measure Short Formula p Euclidean Euc u E (x, y) = (xj yj)2 j= Mahalanobis Mah u Mah (x, y) = (x y) t (x y) p Manhattan Man u Man (x, y) = x j y j j= p Minkowski Min u Minγ (x, y) = ( j= xj yj γ ) γ Tchebytchev Tch u T ch (x, y) = max j p x j y j Cosine Dissimilarity Cos u Cos (x, y) = <x,y> x y p x Canberra Can u Can (x, y) = j y j j= x j + y j p Squared Chord SC u SC (x, y) = ( x j y j= j) 2 p Weighted Euclidean WE u W E (x, y) = αi(xj yj)2 j= Chi-square χ 2 p (x u χ 2 (x, y) = j m j ) 2 j= m j p Jeffrey Divergence JD u JD (x, y) = (x j log x j j= m + y j log y j j m ) j Pearson s Correlation ρ u ρ (x, y) = ρ(x, y) p Normalized Euclidean NE u NE (x, y) = ( x j y j j= σ ) 2 j Where p is the dimension of space, x = (x j ) j=,,p and y = (y j ) j=,,p two points in R p, (α j) j=,,p, the inverse of the variance and covariance matrix, σ 2 j the variance, γ >, m j = x j +y j 2 and ρ(x, y) denotes the linear correlation coefficient of Bravais-Pearson In Table, we give a list of 3 conventional proximity measures For our experiment and comparisons, we took Iris dataset from the UCI-repository

Topological equivalence of proximity measures 3 2 Preorder equivalence Two proximity measures, u i and u j generally lead to different proximity matrices Can we say that these two proximity measures are different just because the resulting matrices have different numerical values? To answer this question, many authors,[7][8][8], have proposed approaches based on preordonnance defined as follows: Definition Equivalence in preordonnance: Let us consider two proximity measures u i and u j to be compared If for any quadruple (x, y, z, t), we have: u i (x, y) u i (z, t) u j (x, y) u j (z, t), then, the two measures are considered equivalent In order to compare proximity measures u i and u j, we need to define an index that could be used as a similarity value between them We denote this by S(u i, u j ) For example, we can use the following similarity index which is based on preordonnance S(u i, u j ) = n x 4 y z t δ ij(x, y, z, t) { if [u i (x, y) u i (z, t)] [u j (x, y) u j (z, t)] > where δ ij (x, y, z, t) = or u i (x, y) = u i (z, t) and u j (x, y) = u j (z, t) otherwise S varies in the range [, ] Hence, for two proximity measures u i and u j, a value of means that the preorder induced by the two proximity measures is the same and therefore the two proximity matrices of u i and u j are equivalent 3 Topological equivalence This approach is based on the concept of a topological graph which uses a neighborhood graph The basic idea is quite simple: we can associate a neighborhood graph to each proximity measure ( this is -our topological graph- ) from which we can say that two proximity measures are equivalent if the topological graphs induced are the same To evaluate the similarity between proximity measures, we compare neighborhood graphs and quantify to what extent they are equivalent 3 Topological graphs For a proximity measure u, we can build a neighborhood graph on a set of individuals where the vertices are the individuals and the edges are defined by a neighborhood relationship property We thus simplify have to define the neighborhood binary relationship between all couples of individuals We have plenty of possibilities for defining this relationship For instance, we can use the definition of the Relative Neighborhood Graph (RNG) [9], where two individuals are related if they satisfy the following property: If u(x, y) max(u(x, z), u(y, z)); z x, y then, V u (x, y) = otherwise V u (x, y) =

4 R Abdesselam, and D A Zighed Geometrically, this property means that the hyper-lunula (the intersection of the two hyper-spheres centered on two points) is empty The set of couples that satisfy this property result in a related graph such as that shown in Figure For the example shown, the proximity measure used is the Euclidean distance The topological graph is fully defined by the adjacency matrix as in Figure Vu x y z t u x y z t u Fig RNG example for a set of points in R2 and the associated adjacency matrix In order to use the topological approach, the property of the relationship must lead to a related graph Of the various possibilities for defining the binary relationship, we can use the properties in a Gabriel Graph or any other algorithm that leads to a related graph such as the Minimal Spanning Tree, MST For our work, we use only the Relative Neighborhood Graph, RNG, because of the relationship there is between those graphs [9] 32 Similarity between proximity measures in topological frameworks From the previous material, using topological graphs (represented by an adjacency matrix), we can evaluate the similarity between two proximity measures via the similarity between the topological graphs each one produces To do so, we just need the adjacency matrix associated with each graph Note that Vui and Vuj are the two adjacency matrices associated with both proximity measures To measure the degree of similarity between the two proximity measures, we just count the number of discordances between the two adjacency matrices The value is computed as: S(Vui, Vuj ) = n2 x Ω y Ω δij (x, y) { if V (x, y) = V (x, y) ui uj where δij (x, y) = otherwise S is the measure of similarity which varies in the range [, ] A value of means that the two adjacency matrices are identical and therefore the topological structure induced by the two proximity measures in the same, meaning that the proximity measures considered are equivalent A value of means that there is a full discordance between the two matrices ( Vui (x, y) = Vuj (x, y) ω Ω 2 ) S is thus the extent of agreement between the adjacency matrices

Topological equivalence of proximity measures 5 4 Comparison between topological and preordonnance equivalences In this work, our goal is not to compare proximity matrices or the preorders induced but to propose a different approach which is the topological equivalence that we compare to the preordering equivalence We propose to test whether the matrices are statistically different or not using Kendall s tau in preoder approach and Cohen s kappa coefficient in topological approach These nonparametric tests compare proximity measures from their associated proximity matrices in preoder equivalence and from their associated adjacency matrices in topological equivalence 4 Statistical Comparisons between two proximity measures For the preorder equivalence, we propose to use a concordance index between preorders induced as a proximity measure between two measures u i and u j To terminate this approach we can, like [], use the generalized rate of Kendall based on the concordance of ranks The ranks of the n(n ) pairs of proximity values between x and y by u i are compared according to ones to u j We note R i (x, y) and R j (x, y) the respective ranks of u i (x, y) and u j (x, y) D(u i, u j ) = 2 δ n(n ) x y x ij(x, y) where δ ij = { if Ri(x, y) = R j(x, y) otherwise The comparison between indices of proximity measures has also been studied by [2], [3] from a statistical perspective The authors proposed an approach that compares similarity matrices, obtained by each proximity measure, using Mantel s test [8], in a pairwise manner Let V ui and V uj adjacency matrices associated with two measures of near u i and u j To compare the degree of the topological equivalence between two measures of proximity, we propose to test if the adjacency matrices associated are statistically different or not, using a non-parametric test of paired data These matrices, and binary symmetric of order n, are unfolded in two vectors matched components, consisting of n(n )/2 values higher (or lower) the diagonal The degree of topological equivalence between two proximity measures is estimated from the Kappa coefficient of concordance [3], calculated on the Table 2 2 contingency formed by the two vectors: { Πo : observed proportion κ = κ(v ui, V uj ) = Πo Πe Π e with Π e : random proportion We then formulate the null hypothesis H : κ = independence of agreement or concordance The concordance is even higher than its value tends to, or perfect maximum if kappa = It is equal to in the case of a perfect discordance The results of these comparison pair of proximity measures are given in Table 2

6 R Abdesselam, and D A Zighed Table 2 ρ s - Preorder (row) & κ - Topologie (column) u E u Mah u Man u Min u T ch u Cos u Can u SC u W E u χ 2 u JD u ρ u NE u E 456 845 835 767 34 588 753 753 753 4 72 u Mah 633 336 456 4 378 3 3 456 222 3 357 357 u Man 996 635 767 75 3 689 767 845 767 767 28 8 u Min 998 627 992 845 258 56 67 835 67 67 4 72 u T ch 99 59 98 994 378 6 767 767 767 767 432 735 u Cos 876 427 858 88 893 76 34 34 34 34 88 24 u Can 92 43 93 97 926 92 753 588 753 753 8 64 u SC 967 524 957 972 979 927 975 753 32 64 u W E 633 996 998 99 876 92 967 753 753 4 72 u χ 2 97 53 96 974 98 928 973 97 32 64 u JD 963 527 959 973 98 928 974 969 32 64 u ρ 977 674 979 972 953 86 899 942 977 946 944 3 u NE 874 434 856 88 892 985 893 94 874 96 95 856 Not significant with risk of error 5% The Kappa and Kendall s tau values between the 3 proximity measures in topological and preordonnance frameworks are given in Table 2 The upper part of the diagonal of Table 2 present the exact values of the Kappa statistic test We can conclude, with a risk error of 5%, that only measures of couples (u Cos, u Can ) and (u Can, u ρ ) are not significantly equivalent in topology Note that measurements of couples (u E, u W E ), (u SC, u χ 2), (u SC, u JD ) and (u χ 2, u JD ) are in perfect topological equivalence (κ = ) 42 Classification of proximity measures Let D ui (E E) and D uj (E E) the distance tables associated with proximity measures u i et u j Each of these distances generates a topological structure of objects E This structure is fully described by its adjacency matrix To measure the degree of resemblance between graphs, simply count the number of discordances between the two adjacency matrices V ui and V uj associated with the two toplogical structures: S(V ui, V uj ) = n 2 x y δij(x, y) with δij(x, y) = { if Vui (x, y) = V uj (x, y) otherwise The value of the similarity measure S means that both adjacency matrices are identical and therefore, the topological structure induced by the two measures is the same In this case, we talk about topological equivalence between the two proximity measures The value means that the topology is completely changed From this measure S, we can compare and classifying proximity measures according to their degree of resemblance The topological and preorder similarity values between the 3 proximity measures are given in Table 3 The values of the similarities are quite close to and proximity measures of couples (u E, u W E ), (u SC, u JD ) and (u χ 2, u JD ) are in perfect topological and preordonnance equivalences We can view these measures of proximity by applying, for example, an Ascendant Hierarchical Classification according to Ward s criterion [4], Figure 2

Topological equivalence of proximity measures 7 Table 3 S(V ui, V uj ) - Topologie (row) & S(u i, u j) - Preorder (column) S u E u Mah u Man u Min u T ch u Cos u Can u SC u W E u χ 2 u JD u ρ u NE u E 776 973 988 967 869 89 942 947 945 926 863 u Mah 876 773 774 752 7 77 737 776 739 738 742 73 u Man 964 84 964 94 855 882 93 973 933 932 924 848 u Min γ 964 876 947 967 87 892 946 988 95 949 925 866 u T ch 947 858 929 964 865 887 94 957 942 942 94 86 u Cos 858 858 84 84 858 893 898 869 899 899 83 957 u Can 9 84 929 893 9 822 943 89 94 942 874 868 u SC 947 84 947 929 947 858 947 942 957 93 884 u Ew 876 964 964 947 858 9 947 947 945 926 863 u χ 2 947 84 947 929 947 858 947 947 92 885 u JD 947 84 947 929 947 858 947 947 94 884 u HIM 884 83 884 867 92 884 884 92 884 92 92 825 u ρ 867 849 83 867 867 973 796 849 867 849 849 876 Not significant with risk of error 5% Pearson Correlation Cosine Dissimilarity Histogram Intersection Mahalanobis Chi-Squared Jeffrey Divergence Squared Chord Canberra Tchebytchev Minkowski Manhattan Weighted Euclidean Euclidean Fig 2 a) Topological & b) Preorder hierarchical trees Mahalanobis Pearson Correlation Cosine Dissimilarity Jeffrey Divergence Squared Chord Chi-Squared Canberra Histogram Intersection Tchebytchev Manhattan Minkowski Weighted Euclidean Euclidean Table 4 Contingency tables: preordonnance topology T C T C2 T C3 T C4 Total P C 6 6 P C2 4 4 P C3 2 2 P C4 T otal 6 4 2 3 T C T C2 T C3 Total P C P C2 2 2 P C3 T otal 2 3 T C T C2 Total P C 2 2 P C2 T otal 3 3 κ = 4348 ; p-value = 574% κ = ; p-value < % κ = ; p-value < % Ratio of inertia : Between / Totale,9,8,7,6,5,4,3,2, 3 classes 2 3 4 5 6 7 8 Number of classes Preordonnance Topology Fig 3 Comparison of hierarchical trees 5 Conclusion and perspectives This work proposes a new approach of topological equivalence between proximity measures, based on the notion of graphs neighborhood The application

8 R Abdesselam, and D A Zighed of the nonparametric test of Kappa on adjacency matrices associated with the proximity measures, has given statistical significance and validate or not the topological equivalence; really if they induce or not same neighborhood structure on the objects The results of comparisons between topological and preordonnance equivalences are almost identical, both for the statistical test and classifications results This topological approach extends to other data types (binary, qualitative) we plan to study the influence of the data and the neighborhood structure results and the effect of the choice a classification method of groups of proximity measures References Batagelj, V, Bren, M: Comparing resemblance measures In Journal of classification 2 (995) 73 9 2 Bouchon-Meunier, M, Rifqi, B and Bothorel, S: Towards general measures of comparison of objects In Fuzzy sets and systems 2, 84 (996) 43 53 3 Cohen, J: A coefficient of agreement for nominal scales Educ Psychol Meas, 2, (96) 27 46 4 Fagin, R, Kumar, R and Sivakumar, D: Comparing top k lists In Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms, Society for Industrial and Applied Mathematics (23) 5 Lerman, I C: Indice de similarité et préordonnance associée, Ordres In Travaux du séminaire sur les ordres totaux finis, Aix-en-Provence (967) 6 Lesot, M J, Rifqi, M and Benhadda,H: Similarity measures for binary and numerical data: a survey In IJKESDP, (29) 63-84 7 Liu, H, Song, D, Ruger, S, Hu, R and Uren, V: Comparing dissimilarity measures for content-based image retrieval In Information Retrieval Technology Springer 44 5 8 N Mantel, A technique of disease clustering and a generalized regression approach, In Cancer Research, 27 (967), pp 29 22 9 Preparata, F P and Shamos, M I: Computational geometry: an introduction In Springer (985) Richter, M M: Classification and learning of similarity measures In Proceedings der Jahrestagung der Gesellschaft fur Klassifikation, Studies in Classification, Data Analysis and Knowledge Organisation Springer Verlag (992) Rifqi, M, Detyniecki, M and Bouchon-Meunier, B: Discrimination power of measures of resemblance IFSA 3 Citeseer (23) 2 Schneider, J W and Borlund, P: Matrix comparison, Part : Motivation and important issues for measuring the resemblance between proximity measures or ordination results In Journal of the American Society for Information Science and Technology 58 (27) 586 595 3 Schneider, J W and Borlund, P: Matrix comparison, Part 2: Measuring the resemblance between proximity measures or ordination results by use of the Mantel and Procrustes statistics In Journal of the American Society for Information Science and Technology 58 (27) 596 69 4 Ward, J R: Hierarchical grouping to optimize an objective function In Journal of the American statistical association JSTOR 58 3 (963) 236 244