Computational Statistics and Data Analysis. Mixtures of weighted distance-based models for ranking data with applications in political studies

Similar documents
Computational Statistics and Data Analysis

Math for Liberal Studies

Lecture 3 : Probability II. Jonathan Marchini

Finding extremal voting games via integer linear programming

Set Notation and Axioms of Probability NOT NOT X = X = X'' = X

Do in calculator, too.

Chapter 1 Problem Solving: Strategies and Principles

Lecture 2 31 Jan Logistics: see piazza site for bootcamps, ps0, bashprob

Discrete Probability Distributions

Unsupervised Rank Aggregation with Distance-Based Models

9/2/2009 Comp /Comp Fall

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

discrete probability distributions

Lecture 4: DNA Restriction Mapping

Lecture 4: DNA Restriction Mapping

8/29/13 Comp 555 Fall

Discrete Probability Distributions

Determining the number of components in mixture models for hierarchical data

Unsupervised machine learning

Finite Mixture Model of Bounded Semi-naive Bayesian Networks Classifier

Data Preprocessing. Cluster Similarity

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Click Prediction and Preference Ranking of RSS Feeds

Properties of the Bayesian Knowledge Tracing Model

CSCI-567: Machine Learning (Spring 2019)

Mixtures of Rasch Models

Statistical Pattern Recognition

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

10.1. Randomness and Probability. Investigation: Flip a Coin EXAMPLE A CONDENSED LESSON

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

An Introduction to Statistical and Probabilistic Linear Models

Model-based cluster analysis: a Defence. Gilles Celeux Inria Futurs

A Framework for Unsupervised Rank Aggregation

Label Switching and Its Simple Solutions for Frequentist Mixture Models

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

secretsaremadetobefoundoutwithtime UGETGVUCTGOCFGVQDGHQWPFQWVYKVJVKOG Breaking the Code

12.1. Randomness and Probability

MARGINAL HOMOGENEITY MODEL FOR ORDERED CATEGORIES WITH OPEN ENDS IN SQUARE CONTINGENCY TABLES

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Minimum message length estimation of mixtures of multivariate Gaussian and von Mises-Fisher distributions

ScienceDirect. Defining Measures for Location Visiting Preference

Lossless Online Bayesian Bagging

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

PIRLS 2016 Achievement Scaling Methodology 1

Aijun An and Nick Cercone. Department of Computer Science, University of Waterloo. methods in a context of learning classication rules.

PAIRED COMPARISONS MODELS AND APPLICATIONS. Regina Dittrich Reinhold Hatzinger Walter Katzenbeisser

Markov Chain Monte Carlo methods

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

More on Unsupervised Learning

Algorithm-Independent Learning Issues

Decision Tree Learning Lecture 2

Analysis of Multinomial Response Data: a Measure for Evaluating Knowledge Structures

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

Generative Clustering, Topic Modeling, & Bayesian Inference

Unsupervised Learning with Permuted Data

L11: Pattern recognition principles

Collaborative topic models: motivations cont

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Mathematics for large scale tensor computations

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Outline Introduction OLS Design of experiments Regression. Metamodeling. ME598/494 Lecture. Max Yi Ren

Model Estimation Example

STA 4273H: Statistical Machine Learning

Machine Learning Linear Classification. Prof. Matteo Matteucci

Investigation into the use of confidence indicators with calibration

STA 4273H: Statistical Machine Learning

Optimal Blocking by Minimizing the Maximum Within-Block Distance

EM (cont.) November 26 th, Carlos Guestrin 1

Structure learning in human causal induction

Machine Learning for Signal Processing Bayes Classification and Regression

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

Different points of view for selecting a latent structure model

Sparse Linear Models (10/7/13)

Machine Learning Lecture 5

Mixtures of Gaussians with Sparse Regression Matrices. Constantinos Boulis, Jeffrey Bilmes

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Choosing the Summary Statistics and the Acceptance Rate in Approximate Bayesian Computation

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood

Machine Learning for OR & FE

Machine Learning

On the Problem of Error Propagation in Classifier Chains for Multi-Label Classification

Probabilistic Time Series Classification

Variable selection for model-based clustering

Comparison between conditional and marginal maximum likelihood for a class of item response models

Model Complexity of Pseudo-independent Models

Statistical Analysis of List Experiments

Statistical Practice. Selecting the Best Linear Mixed Model Under REML. Matthew J. GURKA

arxiv: v1 [math.ra] 11 Aug 2010

18.9 SUPPORT VECTOR MACHINES

A class of latent marginal models for capture-recapture data with continuous covariates

Goodness-of-Fit Tests for the Ordinal Response Models with Misspecified Links

STA 414/2104: Machine Learning

A Bayesian Criterion for Clustering Stability

Combining Predictions in Pairwise Classification: An Optimal Adaptive Voting Strategy and Its Relation to Weighted Voting

Undirected Graphical Models

Transcription:

Computational Statistics and Data Analysis 56 (2012) 2486 2500 Contents lists available at SciVerse ScienceDirect Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda Mixtures of weighted distance-based models for ranking data with applications in political studies Paul H. Lee, Philip L.H. Yu Department of Statistics and Actuarial Science, The University of Hong Kong, Hong Kong a r t i c l e i n f o a b s t r a c t Article history: Received 17 August 2010 Received in revised form 30 January 2012 Accepted 2 February 2012 Available online 15 February 2012 Keywords: Ranking data Distance-based models Mixtures models Analysis of ranking data is often required in various fields of study, for example politics, market research and psychology. Over the years, many statistical models for ranking data have been developed. Among them, distance-based ranking models postulate that the probability of observing a ranking of items depends on the distance between the observed ranking and a modal ranking. The closer to the modal ranking, the higher the ranking probability is. However, such a model assumes a homogeneous population, and the single dispersion parameter in the model may not be able to describe the data well. To overcome these limitations, we formulate more flexible models by considering the recently developed weighted distance-based models which can allow different weights for different ranks. The assumption of a homogeneous population can be relaxed by an extension to mixtures of weighted distance-based models. The properties of weighted distancebased models are also discussed. We carry out simulations to test the performance of our parameter estimation and model selection procedures. Finally, we apply the proposed methodology to analyze synthetic ranking datasets and a real world ranking dataset about political goals priority. 2012 Elsevier B.V. All rights reserved. 1. Introduction Ranking data frequently occur when judges (or individuals) are asked to rank a set of items, which may be political goals, candidates in an election, types of soft drinks, etc. By studying ranking data, we can understand judges perception and preferences on the ranked alternatives. Analysis of ranking data is thus often required in various fields of study, such as politics, market research and psychology. Over the years, various statistical models for ranking data have been developed, including order statistics models, rankings induced by paired comparisons, distance-based models and multistage models. See Critchlow et al. (1991) and Marden (1995) for more details of these models. Among many models for ranking data, distance-based models have the advantages of being simple and elegant. Distance-based models (Fligner and Verducci, 1986) assume a modal ranking π 0 and the probability of observing a ranking π is inversely proportional to its distance from the modal ranking. The closer to the modal ranking π 0, the more frequent the ranking π is observed. There are different measures of distances between two rankings. Some examples are Kendall, Spearman and Cayley distances (see Mallows, 1957; Critchlow, 1985; Diaconis, 1988 and Spearman, 1904). Distance-based models have received much less attention than they deserve, probably because the models are not flexible enough. With the aim of increasing model flexibility, Fligner and Verducci (1986) generalized the one-parameter distancebased models to (k 1)-parameter models, based on the decomposition of a distance measure. However, the symmetry property of the generalized distance measure is lost in their models, and the (k 1)-parameter models does not belong to Corresponding author. Tel.: +852 96846294. E-mail addresses: honglee@graduate.hku.hk (P.H. Lee), plhyu@hku.hk (P.L.H. Yu). 0167-9473/$ see front matter 2012 Elsevier B.V. All rights reserved. doi:10.1016/j.csda.2012.02.002

P.H. Lee, P.L.H. Yu / Computational Statistics and Data Analysis 56 (2012) 2486 2500 2487 Table 1 Some distances for ranking data. Name Short form Formula Spearman s rho R(π, σ) Spearman s rho square R 2 (π, σ) Spearman s footrule F(π, σ) Kendall s tau T(π, σ) k k i=1 k i=1 i<j i=1 [π(i) σ (i)]2 0.5 [π(i) σ (i)]2 π(i) σ (i) I{[π(i) π( j)][σ (i) σ ( j)] < 0} the class of distance-based models. In view of extending the class of distance-based models, Lee and Yu (2010) proposed using weighted distance measures in the models which allow different weights for different ranked items. In this way, the properties of distance can be retained and at the same time the model flexibility is enhanced. However, Lee and Yu (2010) have not studied the properties of these new ranking models, which will be considered in this paper. Distance-based models assume a homogeneous population and this is an important limitation. In the case of heterogeneous data, one can adopt a mixture modeling framework to produce more sophisticated models. The EM algorithm (Dempster et al., 1977) can fit the mixture models in a simple and fast manner. Mixture models for ranking data are not new and have been studied extensively in different research, for example Gormley and Murphy (2006, 2008), Croon (1989), Stern (1993) and Moors and Vermunt (2007). However, these studies were not on distance-based models. Recently, Murphy and Martin (2003) and Melia and Chan (2010) extended the use of mixture models to distance-based models and (k 1)-parameter models respectively to describe the presence of heterogeneity among judges. In this way, the limitation of the assumption of a homogeneous population in distance-based models can be relaxed. Inspired by the results of the aforementioned research, we will develop mixtures of weighted distance-based models for ranking data in this paper. The remainder of this paper is organized as follows. In Section 2, we review the distance-based models for ranking data and in Section 3 we explore the properties of the weighted distance-based models. The newly proposed mixtures of weighted distance-based models are explained in Section 4. Simulation studies are carried out to assess the performance of the model fitting and selection algorithm, and the results are presented in Section 5. To illustrate the feasibility of the proposed model, studies of synthetic ranking datasets and a social science ranking dataset on political goals are presented in Section 6. Finally, concluding remarks are given in Section 7. 2. Distance-based models for ranking data 2.1. Distance-based models For a better description of ranking data, some notations must be defined first. In ranking k items, labeled 1,..., k, a ranking π is a mapping function from 1,..., k to 1,..., k, where π(i) is the rank given to item i. For example, π(2) = 3 means that item 2 is ranked third. A distance function is useful in measuring the discrepancy between two rankings. The usual properties of a distance function are: (1): d(π, π) = 0, (2): d(π, σ) > 0 if π σ, and (3): d(π, σ) = d(σ, π). For ranking data, we require that the distance, apart from having these usual properties, must be right invariant, i.e. d(π, σ) = d(π τ, σ τ), where π τ(i) = π(τ(i)). This requirement can ensure that relabeling of items has no effect on the distance. Some popular distances are given in Table 1, where I{} is an indicator function. Apart from these distances, there are other distances for ranking data, and readers can refer to Critchlow et al. (1991) for details. Diaconis (1988) developed a class of distance-based models, P(π λ, π 0 ) = e λd(π,π 0), C(λ) where λ 0 is the dispersion parameter, d(π, π 0 ) is an arbitrary right invariant distance, and C(λ) is the proportionality constant. In the particular case where d is Kendall s tau, the model is named Mallows φ-model (Mallows, 1957). The parameter λ measures how individuals preferences differ from the modal ranking π 0. The probability of observing a ranking π drops when π is moving away from π 0. When λ approaches zero, the distribution of rankings will become uniform. 2.2. Weighted distance-based models In this subsection we will describe in detail the weighted distance-based models proposed in Lee and Yu (2010). Motivated from the weighted Kendall s tau correlation coefficient proposed by Shieh (1998) and Lee and Yu (2010) defined some weighted distances and they are given in Table 2, where w π0 (i) is the weight assigned to item i. Note that the weights are assigned according to the modal ranking π 0, i.e. w 1 is the weight assigned to the item ranked first in π 0. Introducing weights allows different penalties for different mistakenly ranked items, and hence flexibility of the distance-based model increased. Apart from the weighted Kendall s tau (Shieh, 1998) and weighted Spearman rho square (Shieh et al., 2000), there are many other weighted rank correlations proposed, see, for example, Tarsitano (2009).

2488 P.H. Lee, P.L.H. Yu / Computational Statistics and Data Analysis 56 (2012) 2486 2500 Table 2 Some weighted distances for ranking data. Name Short form Formula Weighted Kendall s tau T w (π, σ π 0 ) w i<j π 0 (i)w π0 ( j)i{[π(i) π( j)][σ (i) σ ( j)] < 0} k Weighted Spearman s rho R w (π, σ π 0 ) w 0.5 i=1 π 0 (i)[π(i) σ (i)] 2 Weighted Spearman s rho square R 2 (π, σ π k w 0) w i=1 π 0 (i)[π(i) σ (i)] 2 Weighted Spearman s footrule F w (π, σ π 0 ) k w i=1 π 0 (i) π(i) σ (i) Table 3 Illustration of the effects of w. Model parameters Correct classification rate of the correctly ranked positions w 1 w 2 w 3 w 4 Item 1 Item 2 Item 3 Item 4 2 2 2 2 0.981 0.963 0.963 0.981 2 2 2 0.01 0.974 0.931 0.852 0.848 2 2 0.01 0.01 0.949 0.840 0.433 0.495 2 0.01 0.01 0.01 0.866 0.300 0.330 0.338 0.01 0.01 0.01 0.01 0.256 0.253 0.253 0.256 In addition to the properties of distance function explained in Section 2.1, another well-studied property of a distance function is the triangle inequality, i.e. d(π a, π c ) d(π a, π b ) + d(π b, π c ). A distance that satisfies the triangle inequality is called a metric. Some of the metrics used to measure the distance between two rankings are T, R and F. The weighted version of these metrics, i.e. T w, R w and F w also satisfy the triangle inequality, and hence they are metrics too. The proof is shown in Appendix A. Metrics may have an advantage over non-metric distances in modeling ranking data as the rankings can be visualized by points in a metric space and the length between any two points determines the metric distance between two associated rankings. Applying a weighted distance measure d w to the distance-based model, the probability of observing a ranking π under the weighted distance-based ranking model is P(π w, π 0 ) = e d w(π,π 0 π 0 ), C(w) where C(w) is the proportionality constant. Generally speaking, if w i is large, few people will tend to disagree that the item ranked i in π 0 should be ranked i. This is because such disagreement will greatly increase the distance and hence the probability of observing it will become very small. If w i is close to zero, people have little or no preference on how the item ranked i in π 0 is ranked, because the change of its rank will not affect the distance at all. An illustration of the effects of w is given in Table 3. For weighted distance-based models with four items, five scenarios are considered, and the corresponding classification rates of the correctly ranked positions based on true ranking probabilities (with respect to π 0 = (1, 2, 3, 4)) for different items are given. Obviously, items with weight 2 are classified to their rank positions correctly with high probabilities, and items with weight 0.01 are classified correctly to their rank positions with probabilities comparable to random guesses. Very often, the modal ranking π 0 is unknown. If the researchers has a clearer a priori idea how ranks in π 0 should be weighted, they can simply estimate π 0 from data only. Otherwise, by estimating π 0, the weight w is implicitly estimated as well and therefore, the distance d w (π, π 0 π 0 ) is changed through the computations; in other words, the weighted distance is basically driven from the data. This may produce a more flexible weighted distance-based model than the (equallyweighted) distance-based model. 3. Properties of weighted distance-based models 3.1. Relationship between weighted distance-based models and (k 1) parameter models Fligner and Verducci (1986) showed that Kendall s tau can be decomposed into (k 1) independent metrics: where T(π, π 0 ) = V π0 (i) = k 1 π 0 (i)=1 k π 0 ( j)=π 0 (i)+1 V π0 (i), I{[π(i) π( j)][π 0 (i) π 0 ( j)] > 0}.

P.H. Lee, P.L.H. Yu / Computational Statistics and Data Analysis 56 (2012) 2486 2500 2489 Among the four distance functions stated in Section 2.1, it is found that only Kendall s tau can have such multistage representation. By applying dispersion parameter λ i on stage V i, the Mallow s φ-model is extended to: P(π λ, π 0 ) = e k 1 π 0 (i)=1 λ π 0 (i) V π 0 (i) C(λ), where λ = {λ i, i = 1,..., k 1} and C(λ) is the proportionality constant. These models were named (k 1) parameter models in Fligner and Verducci (1986). Mallow s φ-models are special cases of φ-component models when λ 1 = = λ k 1. The (k 1) parameter models belong to the class of multistage models (Fligner and Verducci, 1988), but do not belong to the class of distance-based models, as the symmetric property of distance is lost. Notice here that the so-called distance between rankings π and σ in (k 1) parameters models can be expressed as π 0 (i)<π 0 ( j) λ π 0 (i)i{[π(i) π( j)][σ (i) σ ( j)] > 0}, which is obviously not symmetric, and hence it is not a proper distance measure. The weighted tau models proposed in Lee and Yu (2010) belong to distance-based models, because the weighted tau function is a proper distance function. Furthermore, the weighted tau models can also retain the multistage nature of (k 1) parameter models. The ranking, under weighted tau models, can be decomposed into (k 1) stages V 1,..., V k 1 : T w (π, π 0 π 0 ) = where V i, equals V π 0 (i) = k k 1 π 0 (i)=1 π 0 ( j)=π 0 (i)+1 w π0 (i)v π 0 (i), w π0 ( j)i{[π(i) π( j)][π 0 (i) π 0 ( j)] > 0}. In principle, if the distance used in a distance-based ranking model does not satisfy the symmetric property, it is still a ranking model but it does not belong to the class of distance-based ranking models. In addition to the symmetric property of a distance, Hennig and Hausdorf (2006) commented that the invariance property should be a concern as well. 3.2. Other properties of weighted distance-based models As defined in Critchlow et al. (1991), some properties for ranking models are (1) label-invariance, (2) reversibility, (3) L-decomposability, (4) strong unimodality and (5) complete consensus. The definitions of the properties are given in Appendix B. It is natural to see that property (1) is essential to all statistical models for ranking data, and weighted distance-based models with distance types T w, R w, R 2 w and F w all satisfy (1). However, some models do not satisfy properties (2) (5). In particular, all our proposed weighted distance-based models satisfy property (2) (this can be easily verified) and only models with weighted distances R 2 w and F w satisfy (3) (the proof is given in Appendix C). All our proposed models do not satisfy properties (4) and (5) unless all the weights are the same, i.e. distance are unweighted. However, rankings that violate properties (4) and (5) are commonly seen. Consider the song dataset from Critchlow et al. (1991) in which 98 students were asked to rank 5 words, (1) score, (2) instrument, (3) solo, (4) benediction and (5) suit, according to the association with the word song. Only part of the data is given in Critchlow et al. (1991) and we fit a F model, a F w model, and a (k 1)-parameter model for comparison. The details are given in Table 4. The F w model gives the best fit as it has the lowest BIC value. In particular, F w model give the best fit to the ranking 1 2 3 4 5, but F and (k 1)-parameter models give a relatively poor fit. One possible reason is that the data seem violating property (4) (and hence violating property (5)). Note that all models got the same modal ranking 3 2 1 4 5 and item 1 is thus less preferred than item 2 in the modal ranking. It is interesting to examine the ranking pair (1 2 3 4 5, 2 1 3 4 5). Under the strong unimodality property, P(1 2 3 4 5) should be less than P(2 1 3 4 5). However, the observed rankings seem violating such property. In fact, such violation also occurs in many other ranking pairs in the data. As the weighted distance-based model F w does not satisfy property (4), it could give a better fit to the data than the models satisfying property (4). Through the extension using weighted distance, our proposed weighted distance-based models provide a wider class of ranking models. They may give a better fit to the data when properties (4) and/or (5) are not satisfied. If a transposition of two particular items in π 0 yields a large drop in probability compare to P(π 0 ), while a transposition of another two items does not lead to a large reduction in probability, the classical distance-based model will give a poor fit but the weighted distance-based model can give a better fit. The flexibility is illustrated in Table 4. For rankings (1 3 2 4 5) and (2 3 4 1 5), while they both have a Footrule distance of 4 with the modal ranking 3 2 1 4 5, their observe frequencies are quite different. It is clear that F w model is more flexible than F model.

2490 P.H. Lee, P.L.H. Yu / Computational Statistics and Data Analysis 56 (2012) 2486 2500 Table 4 Details of F, F w, and (k 1)-parameter models for the song dataset. Ordering Observed frequency Expected frequency (F) Expected frequency (F w ) Expected frequency (k 1) 3 2 1 4 5 19 28.478 26.592 21.157 3 1 2 4 5 10 6.459 11.384 11.002 1 3 2 4 5 9 1.465 5.204 5.722 3 2 4 1 5 8 6.459 8.203 11.002 1 2 3 4 5 7 1.465 5.557 1.116 2 3 1 4 5 6 6.459 5.204 4.311 3 2 1 5 4 6 6.459 2.001 2.255 3 2 4 5 1 5 1.465 2.001 5.722 2 1 3 4 5 4 1.465 2.379 2.242 3 1 4 2 5 3 1.465 1.503 2.242 2 3 4 1 5 2 1.465 1.605 2.242 3 4 2 1 5 2 1.465 1.083 2.242 3 2 5 4 1 2 1.465 1.583 0.610 Others 0 16.963 8.701 11.086 Total 83 BIC 472.773 446.460 454.835 4. Mixtures of weighted distance-based models 4.1. Mixture modeling framework If a population contains G sub-populations with probability mass function (pmf) P g (x), and the proportion of subpopulation g equals p g, the pmf of the mixture model is P(x) = G p g P g (x). g=1 Hence, the probability of observing a ranking π under a mixture of G weighted distance-based ranking models is: G G e d wg (π,π 0g π 0g ) P(π) = p g P(π w g, π 0g ) = p g. C(w g ) g=1 g=1 And the loglikelihood for n observations is: n G e d wg (π i,π 0g π 0g ) l = log p g, C(w g ) i=1 g=1 where w g and π 0g are the parameters of the weighted distance-based models of sub-population g. A noise component, all weights being zero, are sometimes included to represent a sub-population with completely uncertain rank-order preferences. This could happen in the situations where the population is very large and adding a noise component could help single out the sub-population with random preferences. Estimating the model parameters by direct maximization of the loglikelihood function may lead to a high-dimensional numerical optimization problem. Instead, maximization can be achieved by applying the EM algorithm (Dempster et al., 1977). In short, the E-step of an EM algorithm computes, for all observations, the probabilities of belonging to every sub-population, and the M-step maximizes the conditional expected complete-data loglikelihood given the estimates generated in E-step. To derive the EM algorithm, we define a latent variable z i = (z 1i,..., z Gi ) as: z gi = 1 if observation i belongs to subpopulation g, otherwise z gi = 0. The complete-data loglikelihood is: L com = n i=1 g=1 G z gi [log(p g ) d wg (π i, π 0g π 0g ) log(c(w g ))]. First of all, we will choose the initial parameters for w g, π 0g and p g. Then we will alternatively run the E-step and M-step until the parameters converge. In the E-step, ẑ gi, g = 1, 2,..., G are updated for observations i = 1, 2,..., n, by ẑ gi = ˆp gp(π i ŵ g, ˆπ 0g ). G ˆp h P(π i ŵ h, ˆπ 0h ) h=1 In the M-step, model parameters are updated by maximizing the complete-data loglikelihood with z gi replaced by ẑ gi. The MLE of ˆπ 0g and ŵ g are obtained simultaneously (Fligner and Verducci, 1986). For a given g = 1,..., G, ˆπ 0g is obtained

P.H. Lee, P.L.H. Yu / Computational Statistics and Data Analysis 56 (2012) 2486 2500 2491 by exhaustive search algorithm. And then ŵ g, satisfies the following equation (Murphy and Martin, 2003, pp. 648, Eq. (5)) n ẑ gi d wg (π i, π 0g π 0g ) k! i=1 = P(π n j w g, π 0g )d wg (π j, π 0g π 0g ), ẑ j=1 gi i=1 is obtained. Using the latest weights, the ˆπ 0g is recomputed. The model fitting procedure stops when ˆπ 0g does not change anymore. Based on our limited experience, we found that the parameter estimates are not sensitive to the initialization. Therefore in this paper, random numbers drawn from uniform (0, 1), 1, and the ranking sorted according to the mean rank were used G as initial values for w, p, and π 0 respectively. It is found that the EM algorithm can converge within 20 iterations in most cases of our simulation studies and applications. There will be two major difficulties in fitting weighted distance-based models when k is large. First, the global search algorithm for MLE ˆπ 0g is not practical because the number of possible choices is too large. Instead, as suggested in Busse et al. n (2007), a local search algorithm should be used. They suggested computing the sum of distances i=1 z gid wg (π i, ˆπ 0g π 0g ) for all ˆπ 0g Π, where Π is a set containing all rankings having a Cayley distance of 0/1 to the initial ranking. A reasonable choice of initial ranking can be constructed using mean rank. Second, the numerical computation of the proportionality constant C(w g ) is time consuming. C(w g ) is the summation of e d wg (π,π 0g π 0g ) over all possible π and its computational time increases exponentially with the number of items, k. For small k, the proportionality constant can be computed efficiently by summing over all rankings. For large k, Lebanon and Lafferty (2002) proposed an MCMC algorithm for fitting (unweighted) distance-based models, and the simulation study in Klementiev et al. (2008) showed that the performance of this estimation technique is acceptable for k = 10. Similar methods can be extended to the weighted distance-based models. To determine the number of components in the mixture, we use the Bayesian information criterion (BIC). The BIC equals 2l + v log(n) where l is the maximized loglikelihood, n is the sample size and v is the number of model parameters. The model with the smallest BIC is chosen to be the best model. Murphy and Martin (2003) showed that the BIC worked quite well if there is no noise component in the mixed population. Furthermore, Biernacki et al. (2006) showed that, for a large family of mixtures (with different number of mixing components and the underlying densities of the various components), the BIC criterion is consistent, and the BIC has been shown to be efficient on practical grounds. 4.2. Assessment of the goodness-of-fit As opposed to the criteria derived from the log-likelihood, it has been argued that the log-likelihood values for essentially different models with different implicit flexibility are not directly comparable (Amemiya, 1985; Cox and Hinkley, 1974). Given these opposing views, we would also rely on the goodness-of-fit measure to assess the model performance. To assess the goodness-of-fit of the model, we use the sum of squares Pearson residuals (χ 2 ) suggested by Lee and Yu (2010). χ 2 k! equals i r 2 i, where r i = (O i E i ) Ei is the Pearson residual, and O i and E i are observed and expected frequencies of ranking i respectively. However, if the size of some E i is smaller than 5, the computed chi-square statistic will be biased. We are likely to encounter this problem when the size of the dataset is small and k is large. In this case, we suggest using the truncated sum of squares Pearson residuals criterion described in Erosheva et al. (2007). 5. Simulation studies In this section, three simulation results are reported. The first simulation studies the performance of the estimation algorithm of our weighted distance-based models, the second simulation investigates the effectiveness of using BIC in selecting the number of components, and the third simulation compares the performance of the T w model and the (k 1)- parameter model, both derived from the T model. In the first simulation, datasets of ranking of 4 items, with sample size of 2000 each, were simulated to study the accuracy of model fitting. We considered 4 models (the parameters of the models are listed in Tables 5 and 6) based on weighted Kendall s tau distance in this simulation. The modal rankings were the same in Models 1 and 2, but the dispersion in Model 1 was comparatively larger. The modal rankings were the same in Models 3 and 4 as well, and again the dispersion in Model 3 (second component) was comparatively larger. The initial values for ŵ were randomly drawn from uniform (0, 1). The simulation results, based on 50 replications, are summarized in Tables 7 and 8. Empirical standard deviations of the parameter estimates are provided inside parentheses. There are two observations from these simulation results. First, the model estimates are very close to their actual values, indicating that our proposed algorithm works well for fitting mixtures of weighted distance-based models for ranking data. Second, estimates are more accurate for models with larger weights, as these estimates have smaller standard deviations. It is because for models with large weights, observations tend to be concentrated around the modal ranking, and hence

2492 P.H. Lee, P.L.H. Yu / Computational Statistics and Data Analysis 56 (2012) 2486 2500 Table 5 Simulation settings of Models 1 and 2. Model π 0 w 1 w 2 w 3 w 4 1 1 2 3 4 2 1.5 1 0.5 2 1 2 3 4 1 0.75 0.5 0.25 Table 6 Simulation settings of Models 3 and 4. Model p π 0 w 1 w 2 w 3 w 4 3 0.5 1 2 3 4 2 1.5 1 0.5 0.5 4 3 2 1 2 1.5 1 0.5 4 0.5 1 2 3 4 2 1.5 1 0.5 0.5 4 3 2 1 1 0.75 0.5 0.25 Table 7 First simulation results. π 0 Model 1 Model 2 1 2 3 4 1 2 3 4 w 1 2.002 (0.059) 0.981 (0.081) w 2 1.509 (0.055) 0.779 (0.089) w 3 0.995 (0.032) 0.492 (0.035) w 4 0.497 (0.013) 0.250 (0.030) Table 8 First simulation results (cont d). π 0 Model 3 Model 4 1 2 3 4 4 3 2 1 1 2 3 4 4 3 2 1 p 0.500 (0.007) 0.500 0.499 (0.028) 0.501 w 1 1.976 (0.129) 1.961 (0.123) 2.088 (0.232) 1.039 (0.158) w 2 1.535 (0.121) 1.540 (0.107) 1.458 (0.173) 0.747 (0.174) w 3 0.995 (0.063) 0.995 (0.065) 1.036 (0.182) 0.497 (0.072) w 4 0.500 (0.035) 0.498 (0.025) 0.501 (0.050) 0.252 (0.072) Table 9 Second simulation results. The frequencies of mixture models selected using BIC were recorded. Model N 1 1 + N 2 2 + N 3 1 0 45 5 0 0 0 2 0 37 13 0 0 0 3 0 0 0 49 1 0 4 0 0 0 47 3 0 the probability distribution is different from a uniform distribution in which the standard error of the central tendency is expected to be the largest. In the second simulation, we will use the four models described in the first simulation. Mixtures of weighted distancebased models with G = 1, 2, 3 mixing components were fitted. We repeated this process 50 times and recorded the frequencies that the best mixture model was selected according to BIC. The results are shown in Table 9. The +N notation indicates an additional noise component (uniform model), i.e., w = 0. Note that the model q+n is a mixture with q+1 subpopulations in which one of them is a noise component. The simulation results show that the BIC can identify correctly the number of components, and the performance improves for models with larger weights since they have more observations concentrated at the modal ranking. However, the BIC sometimes suggests including an additional noise component, probably because there is only one parameter in the noise component, and hence the improvement in loglikelihood is less penalized. In the third simulation, we compare the model performance of the T w model and the (k 1)-parameter model in terms of flexibility. Two settings were simulated, one being a (k 1)-parameter model with parameters (1, 0.75, 0.5) and the other being a T w model with parameters (2, 1.5, 0.2, 0.2). The simulation were replicated 50 times, and at each time, a (k 1)-parameter model and a weighted tau model were fitted to the data. Figs. 1 and 2 shows that, when the underlying model is a (k 1)-parameter model, the performance of T w model (in terms of both BIC and sum of square Pearson residual) was very close to that of the (k 1)-parameter model. On the other hand, when the underlying model is a T w model, the T w model outperformed the (k 1)-parameter model. This simulation shows that the T w model is more flexible than the (k 1)-parameter model.

P.H. Lee, P.L.H. Yu / Computational Statistics and Data Analysis 56 (2012) 2486 2500 2493 Fig. 1. Box plot of the BIC of 50 replications. The two ranking data on the left are generated from a (k 1) parameter model and other two ranking data on the right are generated from a T w model. Fig. 2. Box plot of the sum of square Pearson residual of 50 replications. The two ranking data on the left are generated from a (k 1)-parameter model and other two ranking data on the right are generated from a T w model. 6. Applications We will apply our mixture models to six ranking datasets. The first five are synthetic datasets from Cheng et al. (2009), where they turned some multi-class and regression datasets from UC Irvine repository and Statlog collection to ranking datasets, using a naive Bayes classifier approach. The last dataset is a real world ranking dataset from a political study described in Croon (1989). 6.1. Application to synthetic data We consider the synthetic ranking datasets with 4 5 items studied in Cheng et al. (2009). Information of the five datasets is given in Table 10. We compare the performance of our proposed mixtures of weighted distance-based models (T w, R w, R 2 w and F w) with the existing mixtures of distance-based models (Murphy and Martin, 2003). Besides mixtures models with heterogeneous π g and w g, we could also consider models with heterogeneity only on π g but constant w across all mixtures. We denote such models with G mixtures as G throughout the paper. The results are shown in Table 11. The BIC values in bold type represent the best model for that dataset. It is clear that for all types of distance measures, the BIC values for the weighted distancebased model are always smaller than their unweighted counterparts. Furthermore, even though not provided here, weighted distance-based models are always better than their unweighted counterparts in terms of BIC for number of components

2494 P.H. Lee, P.L.H. Yu / Computational Statistics and Data Analysis 56 (2012) 2486 2500 Table 10 Information of the five synthetic ranking datasets. Dataset Size Number of items Authorship 841 4 Calhousing 20 640 4 Cpu-small 8 192 5 Stock 950 5 Vehicle 846 4 Table 11 BIC and number of components of unweighted/weighted models. Dataset Distance/k 1 BIC # Mixture Weighted distance BIC # Mixture Authorship Calhousing Cpu-small Stock Vehicle T 3420.89 3 T w 3291.88 3 R 3395.97 3 R w 3267.56 3 R 2 3474.67 3 R 2 w 3307.00 3 F 3383.43 3 F w 3265.76 3 k 1 3341.12 3 T 117 189.77 4 T w 108 289.27 3 + N R 116 956.57 4 R w 108 206.84 4 R 2 117 472.59 4 R 2 w 107 705.41 4 F 116 099.16 4 F w 108 190.60 4 k 1 114 087.43 4 T 74 349.7 5 T w 71 950.85 5 R 74 158.58 5 R w 70 429.12 5 R 2 74 543.90 5 R 2 w 70 335.52 5 F 73 734.33 5 F w 69 267.12 5 k 1 73 928.17 5 T 8426.10 4 T w 7939.28 5 R 8031.21 5 R w 7865.58 5 R 2 8146.25 5 R 2 w 7661.42 5 F 8126.07 5 F w 7768.24 5 k 1 7975.88 5 T 3768.32 4 T w 3641.59 4 R 3742.28 4 R w 3592.63 4 R 2 3832.48 4 R 2 w 3565.41 4 F 3713.24 4 F w 3558.61 4 k 1 3723.14 4 between 1 and 5. Therefore, we can conclude that our weighted distance-based models could provide a better fit than the (unweighted) distance-based models. 6.2. Application to real data: social science research on political goals To illustrate the applicability of the weighted distance-based models described in Section 3, we study the ranking dataset obtained from Croon (1989). It consisted of 2262 rankings of four political goals for the Government collected from a survey conducted in Germany. The four goals were: (A) maintain order in nation, (B) give people more say in Government decisions, (C) fight rising prices, (D) protect freedom of speech. The respondents were classified into three value priority groups according to their top two choices (Inglehart, 1977). Materialist corresponds to individuals who gave priority to (A) and (C) regardless of the ordering, whereas those who chose (B) and (D) were classified as post-materialist. The last category comprised of respondents giving all the other combinations of rankings and they were classified as holding mixed value orientations. Weighted distance-based models were fitted for four types of weighted distances (T w, R w, R 2 w and F w), with mixing components G = N, 1,..., 3 + N and 4. The BIC values are listed in Table 12. The underlined BIC values represent the best number of components within each distance type, and the BIC value in bold type represents the best mixture model. Finally, we find that the best model is the weighted footrule with G = 3. The BIC is 12 670.82, which is better than the strict utility (SU) model (12 670.87, parameter estimates provided in Table 14, interested readers are suggested to read Croon (1989) for interpretation of the SU model) and the Pendergrass-Bradley (PB) model (12 673.07) discussed in Croon (1989).

P.H. Lee, P.L.H. Yu / Computational Statistics and Data Analysis 56 (2012) 2486 2500 2495 Table 12 BIC of mixture models. # Mixture Distance k 1 Weighted distance T R R 2 F T w R w R 2 w F w N 14 377.52 14 377.52 14 377.52 14 377.52 14 377.52 14 377.52 14 377.52 14 377.52 14 377.52 1 13 052.58 13 001.36 12 988.06 13 163.26 13 000.95 12 974.28 13 011.22 12 951.34 13 174.30 1 + N 13 014.91 13 009.09 12 889.45 13 162.11 13 007.62 12 943.44 13 018.94 12 863.46 13 172.96 2 12 908.05 12 848.57 12 851.75 12 980.63 12 877.35 12 797.52 12 774.10 12 864.90 12 806.18 2 12 936.09 12 849.51 12 925.87 12 981.19 12 908.73 12 887.88 12 827.44 12 929.26 12 900.34 2 + N 12 860.70 12 856.28 12 758.02 12 944.18 12 736.21 12 688.72 12 713.96 12 691.64 12 697.92 2 + N 12 863.18 12 858.08 12 758.74 12 944.04 12 754.21 12 697.32 12 764.72 12 701.58 12 720.92 3 12 846.88 12 832.44 12 754.64 12 902.74 12 741.58 12 692.20 12 678.88 12 671.24 12 670.82 3 12 847.04 12 836.78 12 756.52 12 905.32 12 747.52 12 745.32 12 783.18 12 714.74 12 746.28 3 + N 12 839.56 12 733.08 12 770.09 12 932.52 12 731.39 12 678.06 12 688.20 12 673.36 12 843.80 3 + N 12 846.78 12 844.50 12 737.72 12 913.04 12 748.15 12 699.50 12 776.08 12 717.68 12 724.94 4 12 851.53 12 847.89 12 770.09 12 918.19 12 754.56 12 730.74 12 716.28 12 709.86 12 701.08 4 12 862.49 12 852.23 12 771.97 12 920.77 12 778.42 12 783.94 12 811.66 12 753.36 12 784.08 Table 13 Parameters of weighted footrule mixture model. Group Ordering of goals in π 0 p w 1 w 2 w 3 w 4 1 C A B D 0.352 2.030 1.234 0 0.191 2 A C B D 0.441 1.348 0.917 0.107 0.104 3 B D C A 0.208 0.314 0 0.151 0.552 Table 14 Parameters of SU mixture model. An item with a larger parameter implies that this item is more preferred. Group p A B C D 1 0.449 0.590 1.071 1.730 1.249 2 0.326 1.990 0.920 0.060 1.130 3 0.225 0.691 0.630 0.010 0.071 It is undoubtedly better than the best (unweighted) distance-based model (12 733.08, footrule). For all types of distances, both unweighted and weighted, the lowest BIC appear when G = 3 or 3 + N. The parameter estimates of the best model, mixture of three weighted footrule models, are shown in Table 13. The first two groups, which comprised 79% of respondents, were materialists as they ranked (A) and (C) as more important than the other two goals. The third group was post-materialists as people in this group ranked (B) and (D) as more important. Base on our grouping, Inglehart s theory is not too appropriate in Germany. We should at least distinguish the two types of materialists, one ranking (A) higher than (C), and the other (C) higher than (A). This conclusion is similar to the findings in Croon (1989) and Moors and Vermunt (2007). However, the mixture solution we obtained here is slightly different from the SU mixture solution of Croon. This can be evidenced by visualizing the data via a truncated octahedron (Thompson, 1993). This visualizing technique enables us to have a better understanding of the ranking distribution. All ranking frequencies can be presented in a way that similar rankings are closer in the truncated octahedron. An illustration of the truncated octahedrons is shown in Fig. 3. The 24 rankings are placed on the vertices in a way that the edges represent an adjacent transposition. Among all six hexagon surfaces, there are four surfaces where all vertices have the same top choice. For example, the hexagon surface facing the readers represents the six rankings with top choice C. The dot size is relative to the proportion of the ranking frequencies. Rankings which constitute more than 5% of the corresponding group are dotted. Figs. 4 and 5 show the predicted distributions of F w and SU mixtures models respectively. It can be seen that the three mixtures produced using F w distance are more separated, as the difference between group 1 and 2 in F w mixture model is much more clearer than that in SU mixture model. Detailed frequency tables are provided (Tables 15 and 16). For groups 1 and 2, weights w 3 and w 4 are close to zero while w 1 and w 2 are much larger, indicating that observations from groups 1 and 2 are mainly C A?? and A C?? respectively. As compared with that in groups 1 and 2, the weights in group 3 are relatively closer to zero. This implies that people belonging to this group were less certain about their preferences than people in the other groups. The weight of item A is the largest in group 3, which means A has a relatively high probability to be ranked the last, and it can be seen in Table 16 as well. Although both models suggest a three-mixture solution and their χ 2 values are very close, the constituent of the three mixtures are quite different. In SU mixture the estimated proportions of groups 1 and 2 are 0.449 and 0.326 respectively. Compare with SU mixture solution, our solution has a higher estimated proportion of group 2 (0.441). This difference is mainly due to the difference in assigning rankings A C B D and A C D B to group 2. At the first glance, these two rankings should be assigned to group 2. Referring to Tables 15 and 16, our mixture model assigns approximately 96%

2496 P.H. Lee, P.L.H. Yu / Computational Statistics and Data Analysis 56 (2012) 2486 2500 Fig. 3. A truncated octahedron representing rankings of 4 items. Fig. 4. The truncated octahedron representation of F w mixture model. The three truncated octahedrons represent mixtures having central rankings C A B D, A C B D and B D C A, respectively. Rankings with frequency greater than 5% of the total mixture size are plotted. Comparing with Fig. 3, it is trivial that F w mixtures are much more pure. Fig. 5. The truncated octahedron representation of SU mixture model. of these two rankings to group 2, while for SU mixture model the percentage drops to 63%. The grouping of our weighted footrule mixture model sounds more reasonable than that of the SU mixture model. 7. Conclusion 7.1. Concluding remarks We proposed a new class of distance-based models using weighted distance measures for ranking data. The models assumed that the population is formed by aggregation of some homogeneous sub-populations. Furthermore, we assumed that the rankings observed in each sub-population follow a weighted distance-based model. The weighted distance-based ranking models we proposed in this paper can keep the nature of distance, and at the same time maintain a greater flexibility. Properties of the weighted distance were studied, in particular the relationship between weighted tau and the distance used in (k 1)-parameter models. Although our proposed weighted distancebased models do not satisfy some of the statistical properties of ranking models, we found that our models have indeed advantages. Simulation results showed that the algorithm can accurately estimate the model parameters, and the BIC is appropriate in the model selection. Applications to both synthetic data and real data showed that our extended weighted

P.H. Lee, P.L.H. Yu / Computational Statistics and Data Analysis 56 (2012) 2486 2500 2497 Table 15 Observed and expected frequencies of SU mixture models. Mixtures 1, 2 and 3 represent mixtures having central rankings C A B D, A C B D and B D C A, respectively. Ordering Observed frequency Expected frequency (SU) 1 2 3 Total ABCD 137 11.802 101.686 13.134 126.622 ABDC 29 0.600 30.954 14.246 45.800 ACBD 309 111.084 195.125 9.180 315.388 ACDB 255 92.907 158.243 5.246 256.396 ADBC 52 0.595 29.357 10.247 40.199 ADCB 93 9.778 78.211 5.399 93.389 BACD 48 9.571 20.837 20.187 50.595 BADC 23 0.487 6.343 21.896 28.726 BCAD 61 27.121 3.777 26.448 57.346 BCDA 55 4.310 0.167 56.638 61.115 BDAC 33 0.387 1.048 30.299 31.735 BDCA 59 1.211 0.152 59.820 61.183 CABD 330 286.092 43.282 10.478 339.853 CADB 294 239.280 35.102 5.988 280.369 CBAD 117 86.135 4.088 19.641 109.864 CBDA 69 13.688 0.181 42.061 55.929 CDAB 70 70.159 3.283 7.427 80.869 CDBA 34 13.330 0.179 27.833 41.342 DABC 21 0.479 5.964 11.988 18.431 DACB 30 7.872 15.890 6.316 30.078 DBAC 29 0.385 1.039 23.060 24.484 DBCA 52 1.202 0.151 45.529 46.882 DCAB 35 21.931 3.007 7.612 32.550 DCBA 27 4.167 0.164 28.525 32.856 χ 2 23.073 Table 16 Observed and expected frequencies of F w mixture models. Although both models suggest 3 mixtures solution and their χ 2 values are very close, the constituent of the 3 mixtures are quite different. Ordering Observed frequency Expected frequency (F w ) 1 2 3 Total ABCD 137 1.519 107.782 8.492 117.792 ABDC 29 0.165 38.832 7.300 46.296 ACBD 309 11.566 300.104 5.334 317.005 ACDB 255 9.556 242.962 3.898 256.417 ADBC 52 0.136 38.954 5.334 44.425 ADCB 93 1.037 87.535 4.535 93.106 BACD 48 5.218 25.143 20.176 50.538 BADC 23 0.566 9.059 17.344 26.969 BCAD 61 11.566 16.332 30.115 58.013 BCDA 55 2.782 3.822 52.288 58.892 BDAC 33 0.136 2.120 30.115 32.371 BDCA 59 0.302 1.377 60.825 62.504 CABD 330 302.565 31.155 7.962 341.682 CADB 294 249.985 25.223 5.819 281.026 CBAD 117 88.071 7.268 18.918 114.257 CBDA 69 21.181 1.701 32.848 55.729 CDAB 70 60.121 5.902 10.103 76.126 CDBA 34 17.500 1.706 24.004 43.210 DABC 21 0.387 9.116 9.262 18.765 DACB 30 2.943 20.484 7.874 31.301 DBAC 29 0.113 2.127 22.007 24.246 DBCA 52 0.249 1.381 44.449 46.080 DCAB 35 6.524 13.305 11.752 31.581 DCBA 27 1.899 3.846 27.923 33.668 χ 2 22.811 models could fit the data better than their corresponding equally-weighted counterparts. Furthermore, the interpretation of the model was kept simple and straightforward too. The mixtures of weighted distance-based models are not without limitations. For ranking data with many items, modeling of weights on the less-preferred items may not be necessary as they may be indifferent to the judges. Also, the computational burden will increase exponentially when the number of items becomes larger.

2498 P.H. Lee, P.L.H. Yu / Computational Statistics and Data Analysis 56 (2012) 2486 2500 7.2. Further research direction An area of future research could be the development of the proposed mixtures of weighted distance-based models for partially ranked data. Partial ranking π occurs when only q of the k items (q < k 1) are ranked. Partial ranking is commonly seen in survey studies, where the number of items is very large and/or only the top few preferred items are of particular interest. Extending weighted distance-based models for partially ranked data would greatly widen their applicability. The (k 1)-parameter model was extended for partially ranked data recently by Melia and Bao (2010) and the extension may be applied to weighted distance-based models. It is also of interest to develop computationally efficient algorithms of computing the proportionality constant in the weighted distance-based models for ranking on large number of items. Acknowledgments The research of Philip L. H. Yu was supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. HKU 7473/05H). We thank the associate editor and three anonymous referees for their helpful suggestions for improving this article. Appendix A. Proof of triangle inequality in T w, R w and F w Theorem 1. T w, R w and F w satisfy the triangle inequality. Proof. Weighted tau T w counts the weighted disagreement of every possible item pairs in the two rankings. Without loss of generality, assume π a (i) < π a ( j). For this particular item pair (i, j) there are four possibilities: π b (i) < π b ( j) and π c (i) < π c ( j), π b (i) < π b ( j) and π c (i) > π c ( j), π b (i) > π b ( j) and π c (i) < π c ( j), and π b (i) > π b ( j) and π c (i) > π c ( j). It can be easily confirmed that the contribution to LHS of the inequality must be less than or equal to the contribution to RHS of the inequality, under all four possibilities. Spearman s rho R w can be expressed as k R w (π a, π b ) = i=1 w π0 (i)[π a (i) π b (i)] 2 0.5 0.5 k = [u a (i) u b (i)] 2, i=1 where u a (i) = w π0 (i)π a (i) and u b (i) = w π0 (i)π b (i). As R w has a form similar to Euclidean distance, it satisfies the triangle inequality. Weighted footrule F w can be decomposed to F w = k w π0 (i)f i, i=1 where every F i satisfies the triangle inequality. We can conclude that F w satisfies the triangle inequality. Appendix B. Definition of properties of ranking models (1) Label-invariance. The relabeling of items has no effect on the probability models. (2) Reversibility. A reverse function γ (π) for a ranking of k items is defined as γ (i) = k + 1 i. Reversing the ranking π has no effect on the probability models.

P.H. Lee, P.L.H. Yu / Computational Statistics and Data Analysis 56 (2012) 2486 2500 2499 (3) L-decomposability. The ranking of k items can be decomposed into k 1 stages. At stage i, where i = 1, 2,..., k 1, the best among the items remaining at that stage is selected, and then this item will be removed in the following stages. (4) Strong unimodality (weak transposition property). A transposition function τ ij is defined as τ(i) = j, τ( j) = i, τ(m) = m for all m i, j. With modal ranking π 0, for every pair of items i and j such that π 0 (i) < π 0 ( j) and every π such that π(i) = π( j) 1, P(π) P(π τ ij ), with equality attained at π = π 0. It guarantees the probability is non-increasing as π moves one step away from π 0, for items having adjacent ranks. (5) Complete consensus (transposition property). As compared with the strong unimodality, complete consensus is an even stronger property which guarantees for every pair of items (i, j) such that π 0 (i) < π 0 ( j) and every π such that π(i) < π( j), P(π) P(π τ ij ). From this definition, we can see that complete consensus implies strong unimodality. Appendix C. Properties of weighted distance-based models Theorem 2. R 2 w and F w are L-decomposable. Proof. Critchlow et al. (1991) showed that, a ranking model is L-decomposable if there exists functions f r, r = 1,..., k such that k d(π, e) = f r [π 1 (r)]. r=1 Therefore, R 2 w and F are L-decomposable as and k R 2 w (π, e) = F w (π, e) = r=1 w π0 (π 1 (r))[r π 1 (r)] 2 k w π0 (π 1 (r)) r π 1 (r). r=1 References Amemiya, T., 1985. Advanced Econometrics. Harvard University Press. Biernacki, C., Celeux, G., Govaert, G., Langrognet, F., 2006. Model-based cluster and discriminant analysis with the mixmod software. Computational Statistics and Data Analysis 51, 587 600. Busse, L.M., Orbanz, P., Buhmann, J.M., 2007. Cluster analysis of heterogeneous rank data. In: Proceedings of the 24th International Conference on Machine Learning, pp. 113 120. Cheng, W., Huhn, J., Hullermeier, E., 2009. Decision tree and instance-based learning for label ranking. In: Proceedings of the 26th International Conference on Machine Learning. Cox, D.R., Hinkley, D.V., 1974. Theoretical Statistics. Chapman and Hall, London. Critchlow, D.E., 1985. Metric Methods for Analyzing Partially Ranked Data. In: Lecture Notes in Statistics, vol. 34. Springer, Berlin. Critchlow, D.E., Fligner, M.A., Verducci, J.S., 1991. Probability models on rankings. Journal of Mathematical Psychology 35, 294 318. Croon, M.A., 1989. Latent class models for the analysis of rankings. In: Soete, G.D., Feger, H., Klauer, K.C. (Eds.), New Developments in Psychological Choice Modeling. Elsevier Science, North-Holland, pp. 99 121. Dempster, A.P., Laird, N.M., Rubin, D.B., 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B 39, 1 38. Diaconis, P., 1988. Group Representations in Probability and Statistics. Institute of Mathematical Statistics, Hayward. Erosheva, E.A., Fienberg, S.E., Joutard, C., 2007. Describing disability through individual-level mixture models for multivariate binary data. The Annals of Applied Statistics 1, 502 537. Fligner, M.A., Verducci, J.S., 1986. Distance based ranking models. Journal of the Royal Statistical Society: Series B 48, 359 369. Fligner, M.A., Verducci, J.S., 1988. Multi-stage ranking models. Journal of the American Statistical Association 83, 892 901. Gormley, I.C., Murphy, T.B., 2006. Analysis of Irish third-level college application data. Journal of the Royal Statistical Society: Series A 169, 361 379. Gormley, I.C., Murphy, T.B., 2008. Exploring voting blocs within the Irish electorate: a mixture modeling approach. Journal of American Statistical Association 103, 1014 1027. Hennig, C., Hausdorf, B., 2006. Design of dissimilarity measures: a new dissimilarity measure between species distribution areas. In: Batagelj, V., Bock, H.H., Ferligoj, A., Ziberna, A. (Eds.), Data Science and Classification, pp. 29 38. Inglehart, R., 1977. The Silent Revolution: Changing Values and Political Styles Among Western Publics. Princeton University Press, Princeton. Klementiev, A., Roth, D., Small, K., 2008. Unsupervised rank aggregation with distance-based models. In: Proceedings of the 25th International Conferencece on Machine Learning.