A New Family of Near-metrics for Universal Similarity

Size: px

Start display at page:

Download "A New Family of Near-metrics for Universal Similarity"

William Lindsey
6 years ago
Views:

1 arxiv: v3 [stat.ml] 17 Oct 2017 A New Family of Near-metrics for Universal Similarity Chu Wang Iraj Saniee William S. Kenney Chris A. White October 18, 2017 Abstract We propose a family of near-metrics base on local graph iffusion to capture similarity for a wie class of ata sets. These quasi-metametrics, as their name suggests, ispense with one or two stanar axioms of metric spaces, specifically istinguishability an symmetry, so that similarity between ata points of arbitrary type an form coul be measure broaly an effectively. The propose near-metric family inclues the forwar k-step iffusion an its reverse, typically on the graph consisting of ata objects an their features. By construction, this family of near-metrics is particularly appropriate for categorical ata, continuous ata, an vector representations of images an text extracte via eep learning approaches. We conuct extensive experiments to evaluate the performance of this family of similarity measures an compare an contrast with traitional measures of similarity use for each specific application an with the groun truth when available. We show that for structure ata incluing categorical an continuous ata, the near-metrics corresponing to normalize forwar k-step iffusion (k small) work as one of the best performing similarity measures; for vector representations of text an images incluing those extracte from eep learning, the near-metrics erive from normalize an reverse k-step graph iffusion (k very small) exhibit outstaning ability to istinguish ata points from ifferent classes. Nokia Bell Labs, 600 Mountain Avenue, Murray Hill, NJ (chu.wang, iraj.saniee, william.kenney, chris.white@nokia-bell-labs.com). 1

2 1 Introuction A core requirement an a funamental moule in various machine learning tasks is to measure the similarity between ata points. Clustering, community etection, search an query, an recommenation algorithms, to name only a few, involve similarity erivations that are essentially constructe on top of unerlying istance measures. To this en, numerous similarities have been esigne specifically for particular types of ata or ata sets in orer to optimize the performance [4, 11, 33]. For categorical ata sets, the most straightforwar similarity measure is the overlap measure, which counts the number of matching attributes between two ata points. More elicate measures take into consieration the number of categories, feature frequencies, an other local or global information. Frequently use measures for categorical ata inclue but are not limite to Eskin, Lin, Gooall, an Occurrence Frequency. We refer intereste reaers to [4] for a comprehensive review. For continuous ata sets, completely ifferent similarity measures have been use. Cosine similarity an the inner prouct similarity are textbook approaches [13]. The Minkowski istance is wiely use which inclues both the Manhattan istance (orer 1 version) an Eucliean istance (orer 2 version). If ata points are generate from the same istribution, Mahalanobis istance, or more broaly Bregman ivergence, is frequently aopte [7, 5]. For istance between ata sets or istributions, Bhattacharyya an Hellinger istances an Kullback-Leibler ivergence are often leverage [3, 17, 15]. Hamming istance continues to be use for bit-by-bit string comparison [12]. At the risk of information loss, categorical similarities also apply if the ata is iscretize first. The MDL metho is the most well-known algorithm for such a iscretization task [9]. Besies structure ata types mentione above, unstructure ata, such as text, auio, image, an vieo, are even more commonly encountere [10]. It is not straightforwar to regar unstructure ata as in a simple vector form, because of its sequential or geometrical properties [28, 21, 26, 27, 11]. One nees to carefully transform such unstructure ata into a vector representation before applying similarity measures for reasonable success. For text ata, the traitional tf-if metho captures the frequency information while loses the orer information [30]. More recent eep learning approaches like the wor2vec moel maps any given wor into a ense, low-imensional (usually no more than a few hunre tuple) vector [20]. Deep-learning base 2

3 vector representation is alreay making its way in various applications [16, 6, 14, 34], yet to the best of our knowlege, there is currently no relate work on similarity measures esigne for such vectors. In this paper, we propose a family of near-metrics base on local graph iffusion to quantify a istance or likeness between ata points. Instea of focusing on a specific ata set or ata type, our motivation is to evelop a family of istance-like functions, or near-metrics, that are sufficiently general to capture all the above ata types in a single representation. For n ata points, each of them with m features, we regar the ata set as a bipartite graph consisting of n object noes an m feature noes connecte by mn weighte eges. The ege weights measure the presence, extent or strength, of each feature in each ata object. The similarity of one ata point to another can then be efine by the transition probability of the ranom walk on this bipartite graph from one point to the other for a given number of steps. Such a graph iffusion-base similarity has several natural variants accoring to the locality parameter k which is the number of steps of the ranom walk (its orer), the iffusion irection (the forwar or reverse variant), an the normalization of the feature weights (the normalize variant) where the ege weights are rescale at each incient noe to a to Key Contributions The near-metrics propose are base on the abstract concept of iffusion on the object-feature graph. Therefore, they are not limite to specific ata types or ata sets an we expect them to work in a wie variety of areas. We follow a ual path: first we analyze the graph iffusion similarity itself theoretically in orer to uncover its properties an features; secon, we evaluate the performance of this family of istances on various ata sets an ata types in comparison with frequently use similarities an when possible, to the groun truth. We summarize our key contributions as follows: We propose a family of near-metrics. The intuition behin the localize (i.e., finite-step) graph iffusion similarity is the mass transfer analogy of a ranom walk from one object to another via their common features: the more common the features two objects have the more mass is transferre from an object to the other. The family of near-metrics is analyze theoretically in several respects. We show that, when applie to categorical atasets or istributions, the 3

4 graph iffusion is a metametric. For general ata sets, conitions for the propose near-metrics to be metametrics or quasi-metametrics are thoroughly iscusse. The analysis emonstrates that the propose nearmetrics are able to function as well efine similarities, while giving enough flexibilities to accommoate ifferent types of ata. We evaluate this family of near-metrics on a wie variety of benchmark atasets of all types incluing categorical, continuous, text, images, an their mixtures. For structure ata sets, the propose near-metrics are among the best performing measures of similarity. For compact representations of unstructure ata like texts an images, the propose near-metrics capture the intrinsic relation between ata points an their performance is outstaning. 1.2 Organization of the Paper The rest of the paper is organize as follows. In Section 2, we formally introuce the graph iffusion similarity an its variants, iscuss their basic properties, an provie necessary an sufficient conitions for these to be quasi (loss of symmetry) or metametrics (loss of complete istinguishability). In Section 4, we evaluate the performance of the graph iffusion similarity together with frequently use similarity measures in various bench mark structure ata sets. In Section 5, the evaluation is further conucte for vector representations of unstructure ata incluing texts an images. Section 6 iscusses our results an observations, proposes future research irections, an conclues the paper. 2 Graph Diffusion Similarity Consier n objects, each enowe with a non-negative feature vector of imension m. All the information about objects an their features is capture by a bipartite graph B of n object noes an m feature noes together with the eges between them. The weight of the ege linking object i an feature j is efine as the strength or value of feature j of object i. Continuous ata sets, binary ata sets, an vector representations of unstructure ata can be mappe to this bipartite graph form irectly; for categorical ata, the transformation is also straightforwar: a categorical feature with l ifferent 4

5 categories is replace by a l bits one-hot binary feature vector. Notice that this transform oes not incur any information loss. An example of n = 4 an m = 5 is shown in Figure 1(a), where the blue noes are objects an orange noes are features. To illustrate the iea behin the graph iffusion similarity, we consier the ranom walk on B as follows. We initially place a particle at any object noe the similarity to which we wish to quantify, an let a ranom walk on the graph take place. For example in Figure 1, the particle is originally place at the first noe. Since there are 3 eges pointing away with weights 3, 5, an 2, the probability vector that the particle ens up at feature noes 1 to 5 is (0.3, 0.5, 0, 0.2, 0), respectively, see Figure 1(b). We call this a 1-step iffusion or ranom walk on B starting at noe 1. Now the particle conucts another step of the ranom walk, an the corresponing probability (to arrive at any given object noe) can be compute by aing the probabilities from all feature noe. see Figure 1(c) an 1(). We call these two steps of the ranom walk one roun. The inuce subgraph after one roun (2 steps) gives rise to an object-object graph that we enote by G. In the language of Markov chain theory, the 2m-step ranom walk on B is the first m rouns of iterations in the computation of the stationary istribution (principal eigenvector) of the row-normalize ajacency matrix of G starting at the localization vector u = (0,..., 1,..., 0) where a 1 is place at noe 1 in the above example, see [22] for etails of iterative or power metho. Ranom walk theory is frequently use for the task of community etection [29, 23], yet to our best knowlege, there is no previous work of aopting ranom walk for the construction of similarity measures. Figure 1: Example: the calculation of graph iffusion similarity. Now let the particle start at object i in G, then we efine the orer k iffusion similarity of i to j, enote by g (k) (i, j), as the probability of the particle starting at i ening up at j after k steps in G, or 2k steps in B. Notice that similar objects, that is those with similar features an strengths, have stronger connection in the bipartite graph B, an consequently the 5

6 k-roun transition probability between them in G will be higher. This family of similarity measures may thus be seen as a truncate an localize version of the principal eigenvector computation on G. Unlike this computation, we are intereste in the finite step transition probability between a pair of noes (i, j) on G as a measure of similarity between i an j, which the principal eigenvector oes not provie. It is clear that g (k) (i, j) is not necessarily symmetric, therefore we efine r (k) (i, j) := g (k) (j, i), which we will refer to as the reverse iffusion similarity, an it quantifies the k-step similarity of j to i. To balance the importance of each feature, one can normalize each feature vector s row-sum to 1 an then calculate the graph iffusion similarity. We call the corresponing similarity, enote by n (k) (i, j), the normalize graph iffusion similarity. We will show later that the normalize graph iffusion similarity is symmetric: n (k) (i, j) = n (k) (j, i). All of the above are measures of similarity each with a corresponing measure of istance: g (k) (i, j) := 1 g(k) (i, j), r (k) (i, j) := 1 r(k) (i, j), an (i, j) := 1 n(k) (i, j), which are the graph iffusion istance, reverse n (k) graph iffusion istance, an normalize graph iffusion istance, respectively. For simplifying the analysis, we introuce the matrix form of the above graph iffusion similarity an istance measures. For the n m feature matrix W = (w ij ) where 1 i n an 1 j m, efine iagonal matrices P = (p ij ) an Q = (q ij ) as: p ll = m s=1 w ls, q ll = n s=1 w sl. In other wors, P an Q are the row-sum an column-sum iagonal matrices corresponing to W. We assume that p ll an q ll are non-zero, otherwise we can iscar the null object or remove the absent feature. When its imension is unerstoo, let 1 be the all-one column vector. Define n n matrix S as: S := P 1 W Q 1 W T. (1) Base on the efinition of P an Q, it is clear that S is a row-stochastic matrix since S1 = P 1 W Q 1 W T 1 = P 1 W 1 = 1. Recall that for the ranom walks on B or G, the matrix S is actually the single-step transition matrix on G or the two-step transition matrix on B. Let G (k) = (g (k) (i, j)) be the n n matrix of the pairwise graph iffusion similarity, then it is clear that G (1) = S. The higher orer iffusion similarity is straightforwar to calculate: G (k) = S k = ( P 1 W Q 1 W T ) k. (2) 6

7 In particular, for g (1) (i, j), an explicit formula can be written as: g (1) (i, j) = m s=1 w is w i1 + + w im w js = 1 w 1s + + w ns p ii m s=1 w is w js q ss. (3) A practical issue worth mentioning is the computational cost of the graph iffusion istance. If the goal is to compute a single pair similarity, (5) shows that the cost is O(mn), which is less than ieal. For example, the Eucliean istance or cosine similarity only requires O(m) calculations. However, note that graph iffusion similarity for one pair of objects is not as important as the similarity between a set of objects an a fixe object. To wit, for similarity search an relate tasks relative to an object i, one may nee to calculate g (1) (i, j) for all j followe by ranking. From the matrix form (1), it is clear that the computational cost for this task is still O(mn), which scales linearly in the number of objects an the number of features. Again from (1), the computation of the graph iffusion similarity for all the pairs of objects requires O(mn 2 ) calculations, which is the same as other traitional similarities. The reverse an normalize variants only involve matrix transpose an normalization operations, thus the cost is in the same orer. Since the calculation of the graph iffusion istance can be written in the matrix form (1), parallel computing is also straightforwar to use if an when neee. 3 Metametrics an Quasi-Metametrics In this section, we analyze the conitions uner which the graph iffusion istance or its variants come close to a bona fie metric. We assume that the bipartite graph B is connecte, otherwise the ata set can be partitione into groups of objects with completely ifferent features. As it turns out, we o not actually nee the full strength of a metric but only 1/2 of the key metric properties to juiciously separate points. In fact, some metric properties such as symmetry are even unnatural in our context: among an object A s neighbors B is the most similar to A, but among the many more neighbors of B, A may not be that similar to B. From the four key metric properties, we wish to keep non-negativity an the triangle inequality to preserve a notion of neighborhoo. It turns out the intersection of a quasimetric an a metametric woul be sufficient for our purposes. 7

8 A quasimetric is a metric without the symmetry property while a metametric oes not require that two objects with same features to be ientical. In orer for a function (, ) to be a metametric, has to be non-negative, symmetric, (x, y) = 0 to imply x = y but not necessarily vice versa, an to satisfy the triangle inequality (x, y) + (y, z) (x, z). Notice that in (5), when p ii is the same for all i, g (1) (i, j) becomes symmetric. In fact, the symmetry will be inherite by g (k) (i, j), an we have Theorem 3.1. The normalize graph iffusion istance of orer k, namely n (k) (, ), is a metametric. When applie to istributions or categorical ata, the forwar, reverse, an normalize graph iffusion istances become ientical, an are all metametrics as well. A quasi-metametric is a metametric without symmetry. A quasi-metametric captures the key asymmetric relations in object-feature ata sets an provies a goo neighborhoo structure. Therefore, a quasi-metametric has the only the necessary properties for a similarity to be useful. In the case when p ii are ifferent for ifferent i, symmetry no longer exists, an thus quasi-metametric is our best shot: Theorem 3.2. Let P be the row-sum iagonal matrix for W. If min p ii / max p ii > 2/3, then both the forwar graph iffusion istance g (1) (, ) an te reverse graph iffusion istance r (1) (, ) are quasi-metametrics. Theorem 3.2 shows that, if the row-sums of the features are comparable, then the orer 1 forwar graph iffusion istance an its reverse version are quasi-metametrics. However, since the similarity vector g (k) (i, ) eventually converges to the equilibrium vector of the graph G, the triangle inequality can not hol for large k if the equilibrium vector itself oes not follow the triangle inequality. On the other han, the similarity vector r (k) (i, ) converges to a vector of a constant, thus triangle inequality hols. Detaile proofs for Theorem 3.1 an Theorem 3.2 are left in the supplementary material. 4 Experiments on Structure Data In this section, we evaluate the performance of graph iffusion similarities on structure ata sets. Since similarity is use explicitly or implicitly in a wie spectrum of ata science an machine learning stuies, it is ifficult to analyze 8

9 an evaluate the newly propose near-metrics thoroughly in all kins of tasks an application. Instea, we aim at the first-principle experiments, that we check whether two ata points that are close accoring to a similarity share the same groun truth label. More specifically, let x be any chosen ata point an y the corresponing label. To test the performance of a similarity measure S, we first rank all the ata points with respect to their similarities to x. Then for any 0 < f 1, we calculate the proportion of ata points that hol ifferent labels compare to y in the (nf)-nearest neighbors of x, which yiels the error value e S (x, f) of ata point x at f. The error curve is efine as the average error for all the ata points: E S (f) := 1 e S (x, f). (4) n Notice that E S (1) oes not epen on the similarity measure aopte. It is etermine by the number of ata points in each class. For example, if the ata set contains two classes of equal number of ata points, then E S (1) = 0.5. We efine E S (0) = 0 for convenience. Naturally, the error curve E S (f) is expecte to grow (but not necessarily) when f becomes larger. In the following experiments, we i encounter cases when this curve fluctuates, but most of the time we foun it increases monotonically. In Table 1, we emonstrate the performance of 10 existing similarity measures an the graph iffusion similarity measures of orer 1 to 7, enote by GD1 to GD7. The existing similarity measures inclue overlap, Eskin, IOF, OF, Lin, Gooall3, Gooall4, inner prouct, Eucliean, an cosine. We use reverse graph iffusion uring the experiments. Recall that, when applie to categorical features, all graph iffusion similarities coincie. Among the teste ata sets, the results of 11 are shown in Table 1. In table 1, 9 of the shown ata sets are from the UCI Machine Learning Repository [1]. The LC an PR are loan level ata sets from the two largest P2P sites, Prosper an Lening Club [18, 24]. Because of the page limit, we are not able to emonstrate the error curves irectly in this section; instea, we show the value of E S (0.01), E S (0.02), E S (0.05) in Table 1, which correspon to the average errors at 1%, 2%, an 5% nearest neighbor sets. We make several key observations as follows. It is shown in Table 1 that no single similarity measure ominates all others, which is in accorance with the observation in [4] an the no free lunch theorem in optimization an machine learning [35]. The orer 1 forwar graph iffusion similarity x 9

10 overlap Eskin IOF OF Lin G3 G4 inner l2 cosine GD1 GD2 GD3 GD4 GD5 GD6 GD7 Balance Scale m = n = SPECT Heart m = n = Lening Club m = n = Nursery m = n = Prosper m = n = Tic-Tac-Toe m = n = Car Evaluation m = n = Chess m = n = Hayes-Roth m = n = Connect m = n = Solar Flare m = n = Table 1: Evaluation of similarity measures on ifferent ata sets. For each similarity S, the three rows escribe the value of E S (0.01), E S (0.02), an E S (0.05). For each ata set, its object number n an attribute number m are liste below its name. 10

11 g (1) (, ) is among the best, while IOF, Lin, an Gooall3 also perform well on certain ata sets. In aition, it can be observe that g (1) (, ) usually performs the best compare to its higher orer versions. We will come back to this observation in the next section. 5 Experiments on Unstructure Data In this section, we present experimental evaluations of similarity measures on vector representations of unstructure ata. Unstructure ata like text an images, espite their vast volume an importance, are less unerstoo an more ifficult to analyze. The major reason appears to be that unstructure ata can not be use irectly for the tasks of search, clustering, or recommenation. To this en, various ways of transforming unstructure ata into the vector form have been propose [20, 16]. For example, tf-if converts a sentence or a ocument into a sparse, high-imensional vector [31], an wor2vec moel translates texts into ense vector with semantics in low imensions [20]. Similar approaches for ata sets consisting of or incluing images are still awaiting etaile ocumentation. In this section, we focus on the experimental evaluation of similarity measures base on vector representations erive from tf-if an eep learning. Our evaluation criterion remains the same as in Section 4: to use the error curve E S (f) for comparitson. The experiments are conucte on the IMDB movie review ataset [19], the customer verbatim ataset, an the ImageNet ataset [8]. 5.1 IMDB Movie Review Dataset The IMDB ataset consists of 50,000 movie reviews in the text form [19]. The length of the review varies from very short to more than 2,000 wors. Each movie review is associate with a binary sentiment polarity label. There are 25,000 positive reviews an 25,000 negative reviews. Naturally, a goo similarity measure shoul yiel a higher similarity score for reviews holing the same label. We test the similarity measures on two kins of representations: the tf-if representation an a compact embeing from a convolutional neural network. tf-if representation. A irect way of translating a review into a feature vector is via the tf-if representation [31]. In general, the importance of a wor 11

12 in a review is aresse by its frequency in that review multiplie by its inverse ocument frequency in the entire corpus. Even though this transformation risks over-simplification, it captures some basic frequency-relate information of the texts an is wiely aopte [25, 2]. In the left of Figure 2, we emonstrate the error curves of the forwar graph iffusion similarity, its reverse an normalize variants, an several traitionally use similarity measures. It can be observe that the traitional measures incluing Eucliean, Manhattan, inner prouct, an cosine, are consierably outperforme by the reverse an the normalize graph iffusion similarities. Figure 2: Error curves for the tf-if representation of IMDB movie review ataset. The family of graph iffusion similarities also perform ifferently. We emonstrate in the right of Figure 2 the reverse graph iffusion similarity measures of orer from 1 to 7 as an example. It can be observe that the performances improves an later ecreases as the orer increases, an the orer 4 an orer 5 curves are the best among the 7. Compact embeing via eep learning. In aition to the tf-if transformation, a more recent way of embeing a review into vector space is via eep learning. The eep convolutional neural network we use here is esigne for sentiment analysis by Kim [14]. The input layer converts any incoming paragraph into a vector of unetermine length, then ifferent sizes of convolutional winows further transform the vector into vectors of values. After that, a max pooling layer eliminate the varying length an thus the number of noes is the same as the number of convolution winows. After the training session, the part of the neural network from the input layer to 12

13 the last hien layer itself becomes a function that maps any paragraph of texts into a fixe length of vector. In the following experiments, the vectors are from a CNN with 128 convolution kernels an thus the feature vector consists of 128 imensions. Similar results are observe for ifferent sizes of CNN as long as the number is not too small to lose track of the original information in the sentences. In Figure 3, we show the optimal error curve for reference. The optimal error curve correspons to the optimal similarity measure uner which each review s neighbors are always with the same opinion label an the reviews holing ifferent labels are far away from each other. Therefore the first half of the optimal error curve is 0 an then it graually increases to 0.5. It can be observe at the left of Figure 3 that the reverse an normalize graph iffusion similarity measures outperform others in a clear way. Furthermore, it is shown in the right of Figure 3 that the orer 2 normalize graph iffusion similarity almost coincie with the optimal curve. Figure 3: Error curves for the compact representation of IMDB movie review ataset. An interesting phenomenon regaring graph iffusion similarities of ifferent orers emerges from Figure 2 an Figure 3: when the orer increases, the performance increases at first (first phase) up to a critical orer but then ecreases later to become completely ranom (secon phase). This secon phase is natural because the graph iffusion similarity (k) (i, ) converges to the equilibrium vector of the graph, an the reverse an normalize versions converge to vector of a constant, all of which are oome to be poor. As for the first phase an its critical orer, for the compact representation in Figure 3, the best performance is achieve at orer 2, whereas for the sparse 13

14 representation of tf-if, the optimal orer is at 4 or 5. From the iscussion in Section 2 we recall that the spee of the information propagation in the graph is highly relate to the sparsity of the feature matrix W. When W or the graph is too sparse, the similarity between a pair of objects can harly be aequately quantifie if the initial probability mass is not well iffuse in the entire system. Thus, we nee higher orer to achieve goo performance for sparse ata. 5.2 Customer Verbatim Data set The customer verbatim ata set is a private ata set containing 21,621 paragraphs, each being a review of a company an its competitors proucts an services quantifie by a collection of business customers. Each of the reviews is labele accoring to whether it is in favor of the company s competitors, an whether the customer likes the company s main prouct. Again, we first use a CNN to o an embeing for extraction of compact feature vectors. Eventually, each review is represente by a 128-imensional vector of non-negative values. We calculate the error curves for both the opinion-of-competitor label an the opinion-of-prouct label. The curves are shown in Figure 4. We observe that the graph iffusion similarity family is able to achieve near optimal performance, while traitional similarities only perform in a meiocre way. As before, the performance of the graph iffusion similarity family improves to reach a near optimal state (at orer 2 for opinion-ofcompetitor an orer 2 or 3 for opinion-of-prouct) an then eteriorates. This again supports our conjecture that the optimal orer for the graph iffusion similarity is positively correlate to the feature sparsity. 5.3 ImageNet Dataset The performance of the graph iffusion similarities are also teste on the ImageNet ataset [8]. We aopt the GoogLeNet moel traine on 1.2 million images for a classification problem for 1,000 classes [32]. Again, we extract the last hien layer, which contains 1,024 hien noes of non-negative values for each image. For better visualization, we ranomly pick r classes out of the 1,000 ifferent classes an calculate the corresponing error curves. We repeat this for 200 times an compute the average error curves. The case r = 2 an r = 5 are emonstrate in Figure 5. 14

15 Figure 4: (Left) Error curves for the competitor label; (right) error curves for the prouct label. 15

16 Figure 5: Error curves of (upper) r = 2 an (lower) r = 5 for GoogleNet ataset. 16

17 Again, compare to traitional similarities, the graph iffusion similarity family performs better in terms of istinguishing ifferent classes. For the case of r = 5, the orer 2 curve performs the best, while for the case of r = 2, we observe ifferent behaviors. The performances of the curve for orer 2 to 6 are almost ientical, an suenly the orer 7 curve becomes the constant 0.5, which inicates the similarity has homogenize for any pair of objects. 6 Discussion an Future Work A key observation in the last section was that when increasing the number of iterations, the performance of the graph iffusion similarity usually improves at first an then becomes worse. The optimal number of iterations for sparse representations like the tf-if is typically larger (4 to 5) an for ense embeings it is smaller (2). For structure ata sets in Section 4, the same phenomenon is observe. We have observe similar phenomenon in other ata sets which, ue to the page limit, we o not emonstrate here. Purely from the perspective of graph theory, object-feature sparsity means that it takes more iterations for mass to iffuse from one object to another, which makes graph iffusion similarity less ieal with small iterations. We conjecture that this phenomenon shoul be common for all kins of ata sets, an call for future work on builing an explicit relation between the sparsity of the feature vectors an the optimal number of iterations. 17

18 References [1] Arthur Asuncion an Davi Newman. Uci machine learning repository, [2] Aam Berger, Rich Caruana, Davi Cohn, Dayne Freitag, an Vibhu Mittal. Briging the lexical chasm: statistical approaches to answerfining. In Proceeings of the 23r annual international ACM SIGIR conference on Research an evelopment in information retrieval, pages ACM, [3] Anil Bhattacharyya. On a measure of ivergence between two multinomial populations. Sankhyā: the inian journal of statistics, pages , [4] Shyam Boriah, Varun Chanola, an Vipin Kumar. Similarity measures for categorical ata: A comparative evaluation. In Proceeings of the 2008 SIAM International Conference on Data Mining, pages SIAM, [5] Lev M Bregman. The relaxation metho of fining the common point of convex sets an its application to the solution of problems in convex programming. USSR computational mathematics an mathematical physics, 7(3): , [6] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, an Pavel Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug): , [7] Roy De Maesschalck, Delphine Jouan-Rimbau, an Désiré L Massart. The mahalanobis istance. Chemometrics an intelligent laboratory systems, 50(1):1 18, [8] Jia Deng, Wei Dong, Richar Socher, Li-Jia Li, Kai Li, an Li Fei-Fei. Imagenet: A large-scale hierarchical image atabase. In Computer Vision an Pattern Recognition, CVPR IEEE Conference on, pages IEEE, [9] Usama Fayya an Keki Irani. Multi-interval iscretization of continuousvalue attributes for classification learning

19 [10] Amir Ganomi an Murtaza Haier. Beyon the hype: Big ata concepts, methos, an analytics. International Journal of Information Management, 35(2): , [11] A Areshir Goshtasby. Similarity an issimilarity measures. Springer, [12] Richar W Hamming. Error etecting an error correcting coes. Bell Labs Technical Journal, 29(2): , [13] William P Jones an George W Furnas. Pictures of relevance: A geometric analysis of similarity measures. Journal of the American society for information science, 38(6):420, [14] Yoon Kim. Convolutional neural networks for sentence classification. arxiv preprint arxiv: , [15] Solomon Kullback an Richar A Leibler. On information an sufficiency. The annals of mathematical statistics, 22(1):79 86, [16] Quoc V Le an Tomas Mikolov. Distribute representations of sentences an ocuments. In ICML, volume 14, pages , [17] Lucien Le Cam an Grace Lo Yang. Asymptotics in statistics: some basic concepts. Springer Science & Business Meia, [18] LeningClub. Lening club corporation, [19] Anrew L. Maas, Raymon E. Daly, Peter T. Pham, Dan Huang, Anrew Y. Ng, an Christopher Potts. Learning wor vectors for sentiment analysis. In Proceeings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages , Portlan, Oregon, USA, June Association for Computational Linguistics. [20] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrao, an Jeff Dean. Distribute representations of wors an phrases an their compositionality. In Avances in neural information processing systems, pages ,

20 [21] Dimitrios Milioris an Philippe Jacquet. Joint sequence complexity analysis: Application to social networks information flow. Bell Labs Technical Journal, 18(4):75 88, [22] James R Norris. Markov chains. Number 2. Cambrige university press, [23] Pascal Pons an Matthieu Latapy. Computing communities in large networks using ranom walks. In International Symposium on Computer an Information Sciences, pages Springer, [24] Prosper. Prosper marketplace inc., [25] Juan Ramos et al. Using tf-if to etermine wor relevance in ocument queries. In Proceeings of the first instructional conference on machine learning, [26] Konra Rieck. Similarity measures for sequential ata. Wiley Interisciplinary Reviews: Data Mining an Knowlege Discovery, 1(4): , [27] Konra Rieck an Pavel Laskov. Linear-time computation of similarity measures for sequential ata. Journal of Machine Learning Research, 9(Jan):23 48, [28] Konra Rieck, Pavel Laskov, an Sören Sonnenburg. Computation of similarity measures for sequential ata using generalize suffix trees. Avances in neural information processing systems, 19:1177, [29] Martin Rosvall an Carl T Bergstrom. Maps of ranom walks on complex networks reveal community structure. Proceeings of the National Acaemy of Sciences, 105(4): , [30] Gerar Salton an Michael J McGill. Introuction to moern information retrieval [31] Gerar Salton, Anita Wong, an Chung-Shu Yang. A vector space moel for automatic inexing. Communications of the ACM, 18(11): ,

21 [32] Christian Szegey, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Ree, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, an Anrew Rabinovich. Going eeper with convolutions. In Proceeings of the IEEE Conference on Computer Vision an Pattern Recognition, pages 1 9, [33] Kai Ming Ting, Ye Zhu, Mark Carman, Yue Zhu, an Zhi-Hua Zhou. Overcoming key weaknesses of istance-base neighbourhoo methos using a ata epenent issimilarity measure. In Proceeings of the 22n ACM SIGKDD International Conference on Knowlege Discovery an Data Mining, pages ACM, [34] Joseph Turian, Lev Ratinov, an Yoshua Bengio. Wor representations: a simple an general metho for semi-supervise learning. In Proceeings of the 48th annual meeting of the association for computational linguistics, pages Association for Computational Linguistics, [35] Davi H Wolpert. The supervise learning no-free-lunch theorems. In Soft computing an inustry, pages Springer,

22 Appenix: proofs of Theorem 1 an Theorem 2 The etaile proofs for Theorem 1 an Theorem 2 are shown as follows. Recall that an explicit formula for g (1) (i, j) is written as g (1) (i, j) = m s=1 w is w i1 + + w im w js = 1 w 1s + + w ns p ii m s=1 w is w js q ss. (5) Basic Properties We begin with some basic properties of the graph iffusion similarity an its variants. It is clear that 0 g (k) (i, j) = 1 g(k) (i, j) 1 since g (k) (i, j) is a (transition) probability or transferre mass. Further, we have Proposition 6.1. If g (k) (i, j) = 0, then i = j an object i is isolate from the rest of the objects. If g (k) (i, j) = 1 for any k, then object j can not be reache by object i in G. Proof. If g (k) (i, j) = 0, then g(k) (i, j) = 1, then all the probability measure at i is transferre to j after k rouns. If i j, then there shoul be a feature s such that w is > 0 in orer for the probability starting at i to propagate to j via a path. However, w is > 0 means that i is also connecte to itself an thus there will always be a positive probability at i, which implies g (k) (i, j) > 0. Thus, by contraiction, i = j an for any s such that w is > 0, w ls = 0 for l i, hence i is isolate. On the other han, if g (k) (i, j) = 1 for all k, then there is zero probability transferre from i to j in any steps, therefore there is no path from i to j in G. Notice that g (k) (i, i) is not necessarily 0, ue to ispersion of mass to other noes through common features. As for symmetry in the graph iffusion similarity, it is clear from (5) that g (1) (i, j) is not symmetric in general. However, we have Proposition 6.2. A sufficient conition for G (1) to be symmetric is that p ii are the same for all i. If the bipartite graph is connecte, then the conition that all the p ii are the same is also necessary for G (1) to be symmetric. Proof. The first part is straightforwar by (5). For the secon part of the statement, notice that for any pair (i, j), there shoul be a connecte path i, k,..., j, an i being connecte to k means w i1 w k1 + + w im w km > 0 an 22

23 thus p ii = p kk accoring to (5). The same argument hols for any consecutive objects in the path between i an j, which leas to the conclusion that p ii = p jj. The Triangle Inequality In this section, we assume p ii is a constant for all i which coul be achieve via scaling. We will show that this conition ensures the resulting graph iffusion istance will be a metametric. It is clear that the normalize graph iffusion istance satisfies this conition, since it normalizes the feature weights before calculating similarity. Further, for istributional ata an categorical ata this conition always hols since for istributions p ii = 1, an for categorical ata, p ii equals the number of categories. Therefore, the forwar, reverse, an normalize variants of the graph iffusion istance are ientical when applie to istributions or categorical ata. In the following analysis, we will use n (k) (, ) for concreteness, an the proof for istributions an categorical ata irectly follows. We first consier the case when the number of iterations is 1. In such a case, we will prove: Proposition 6.3. For any row-stochastic matrix W an its column-sum iagonal matrix Q, efine matrix D = ( ij ) as D := 11 T W Q 1 W T. Then D is a symmetric matrix, an the triangular inequality ij + jk ik hols for any 1 i, j, k n. Proof. Base on the efinition of D, we have m w ik w jk ij = 1, (6) q kk k=1 an thus the symmetry of D follows immeiately. For the triangle inequality, without loss of generality, we only nee to prove that , which is expane as m (w 1k + w 3k )w 2k w 1k w 3k 1. (7) q kk k=1 Notice that w 1k w 3k 0 an q kk w 1k + w 2k + w 3k, then it is sufficient to prove that m (w 1k + w 3k )w 2k 1. w 1k + w 2k + w 3k k=1 23

24 It is easy to check xy x + y 1 9 x y hols for any x an y given x + y > 0, since it is equivalent to (x 2y) 2 0. By letting x = w 1k + w 3k an y = w 2k, we have m k=1 (w 1k + w 3k )w 2k w 1k + w 2k + w 3k 1 9 m (w 1k + w 3k ) k=1 m w 2k = 2 3 < 1, k=1 which completes the proof of triangle inequality, an thus the proposition. We note that the coefficient 2/3 in the above is tight in the sense that there exists a construction of W such that all the above inequalities become equality. The construction is as follows. Let w 11 = 1, w 12 = 0, w 21 = w 22 = 1/2, w 31 = 0, w 32 = 1. Let w 1k = w 2k = w 3k = 0 for k 3. For all r 4, let w r1 = w r2 = 0. Set w rk to any non-negative value such that w r3 + + w rm = 1. It is clear that W uner the above construction is row-stochastic. Besies, g (1) (1, 2) = g(1) (2, 3) = 1/3 an g(1) (1, 3) = 0, an hence g(1) (1, 2)+g(1) (2, 3) g (1) (1, 3) = 2/3. Next we consier the triangle inequality for general orer r. We will prove that: Proposition 6.4. For any row-stochastic matrix W with its column-sum iagonal matrix Q, efine matrix D (r) := ( (r) ij ) as D(r) := 11 T (W Q 1 W T ) r. Then the triangular inequality (r) ij an any positive integer r. + (r) jk (r) ik hols for any 1 i, j, k n Proof. The statement for r = 1 is prove in Proposition 6.3. We will consier the case r = 2u (an even number) an the case r = 2u + 1 (an o number) separately. For the case r = 2u, enote W = (W Q 1 W T ) u, then W is symmetric. Since both W an Q 1 W T are row-stochastic, W is actually oubly-stochastic, an D (r) = 11 T W W T. Since W is row-stochastic, then so is W = W Q 1 W T, thus the corresponing column-sum iagonal matrix Q 24

25 for W becomes an ientity matrix. By regaring W as a new feature matrix W in Proposition 6.3, the triangular inequality follows immeiately. For the case r = 2u + 1, let W = W W, then D (r) = 11 T W W Q 1 W T W T = 11 T W Q 1 W T. (8) Recalling Proposition 6.3, we only nee to show that W is a row-stochastic matrix an Q is its column-sum matrix. Since both W an W are rowstochastic matrices, so is W = W W. For its column-sum, since W is oublystochastic, W T 1 = W T WT 1 = W T 1, thus W an W share the same column-sum matrix Q, which completes the proof. Theorem 1 follows irectly from Proposition 6.1, Proposition 6.2, Proposition 6.3, an Proposition 6.4. Analysis of g (k) (, ) an r(k) (, ) For the forwar graph iffusion istance g (k) (, ) an its reverse version r (k) (, ), symmetry is no longer guarantee. Besies, triangle inequality nee not hol in general. We aim to fin sufficient conitions for these istances to be at least quasi-metametrics. A counter-example to the triangle inequality is provie by these 3 objects an 2 features: W = [1, 0; 2, 6; 0, 12]. It is straightforwar to check that g (1) (1, 2)+g(1) (2, 3) g(1) (1, 3) = 1/3+1/2 1 < 0, an also r (1) (3, 2) + r(1) (2, 1) g(1) (3, 1) = 1/3 + 1/2 1 < 0. The reason for failure of the triangle inequality is that ifferent features have istinct total sums of features p ii. Now we are in a position to prove Theorem 2: Proof. Base on the iscussion in Section 6, we only nee to prove the triangle inequality. Again, we only nee to prove g (1) (1, 2) + g(1) (2, 3) g(1) (1, 3) without loss of generality. Following the proof in Proposition 6.3, it suffices to show that m k=1 (w 1k /p 11 + w 3k /p 22 )w 2k w 1k w 3k /p 11 q kk 1. (9) 25

26 Following the same argument in the proof of Proposition 6.3, we have m k=1 m k=1 (w 1k /p 11 + w 3k /p 22 )w 2k w 1k w 3k /p 11 q kk (w 1k /p 11 + w 3k /p 22 )w 2k 2 max p ii = 1, w 1k + w 2k + w 3k 3 min p ii which completes the proof of Theorem 2. 26

Similarity Measures for Categorical Data A Comparative Study. Technical Report

Similarity Measures for Categorical Data A Comparative Stuy Technical Report Department of Computer Science an Engineering University of Minnesota 4-92 EECS Builing 200 Union Street SE Minneapolis, MN