Computational Statistics and Data Analysis. Mixtures of weighted distance-based models for ranking data with applications in political studies

Size: px
Start display at page:

Download "Computational Statistics and Data Analysis. Mixtures of weighted distance-based models for ranking data with applications in political studies"

Transcription

1 Computational Statistics and Data Analysis 56 (2012) Contents lists available at SciVerse ScienceDirect Computational Statistics and Data Analysis journal homepage: Mixtures of weighted distance-based models for ranking data with applications in political studies Paul H. Lee, Philip L.H. Yu Department of Statistics and Actuarial Science, The University of Hong Kong, Hong Kong a r t i c l e i n f o a b s t r a c t Article history: Received 17 August 2010 Received in revised form 30 January 2012 Accepted 2 February 2012 Available online 15 February 2012 Keywords: Ranking data Distance-based models Mixtures models Analysis of ranking data is often required in various fields of study, for example politics, market research and psychology. Over the years, many statistical models for ranking data have been developed. Among them, distance-based ranking models postulate that the probability of observing a ranking of items depends on the distance between the observed ranking and a modal ranking. The closer to the modal ranking, the higher the ranking probability is. However, such a model assumes a homogeneous population, and the single dispersion parameter in the model may not be able to describe the data well. To overcome these limitations, we formulate more flexible models by considering the recently developed weighted distance-based models which can allow different weights for different ranks. The assumption of a homogeneous population can be relaxed by an extension to mixtures of weighted distance-based models. The properties of weighted distancebased models are also discussed. We carry out simulations to test the performance of our parameter estimation and model selection procedures. Finally, we apply the proposed methodology to analyze synthetic ranking datasets and a real world ranking dataset about political goals priority Elsevier B.V. All rights reserved. 1. Introduction Ranking data frequently occur when judges (or individuals) are asked to rank a set of items, which may be political goals, candidates in an election, types of soft drinks, etc. By studying ranking data, we can understand judges perception and preferences on the ranked alternatives. Analysis of ranking data is thus often required in various fields of study, such as politics, market research and psychology. Over the years, various statistical models for ranking data have been developed, including order statistics models, rankings induced by paired comparisons, distance-based models and multistage models. See Critchlow et al. (1991) and Marden (1995) for more details of these models. Among many models for ranking data, distance-based models have the advantages of being simple and elegant. Distance-based models (Fligner and Verducci, 1986) assume a modal ranking π 0 and the probability of observing a ranking π is inversely proportional to its distance from the modal ranking. The closer to the modal ranking π 0, the more frequent the ranking π is observed. There are different measures of distances between two rankings. Some examples are Kendall, Spearman and Cayley distances (see Mallows, 1957; Critchlow, 1985; Diaconis, 1988 and Spearman, 1904). Distance-based models have received much less attention than they deserve, probably because the models are not flexible enough. With the aim of increasing model flexibility, Fligner and Verducci (1986) generalized the one-parameter distancebased models to (k 1)-parameter models, based on the decomposition of a distance measure. However, the symmetry property of the generalized distance measure is lost in their models, and the (k 1)-parameter models does not belong to Corresponding author. Tel.: addresses: honglee@graduate.hku.hk (P.H. Lee), plhyu@hku.hk (P.L.H. Yu) /$ see front matter 2012 Elsevier B.V. All rights reserved. doi: /j.csda

2 P.H. Lee, P.L.H. Yu / Computational Statistics and Data Analysis 56 (2012) Table 1 Some distances for ranking data. Name Short form Formula Spearman s rho R(π, σ) Spearman s rho square R 2 (π, σ) Spearman s footrule F(π, σ) Kendall s tau T(π, σ) k k i=1 k i=1 i<j i=1 [π(i) σ (i)]2 0.5 [π(i) σ (i)]2 π(i) σ (i) I{[π(i) π( j)][σ (i) σ ( j)] < 0} the class of distance-based models. In view of extending the class of distance-based models, Lee and Yu (2010) proposed using weighted distance measures in the models which allow different weights for different ranked items. In this way, the properties of distance can be retained and at the same time the model flexibility is enhanced. However, Lee and Yu (2010) have not studied the properties of these new ranking models, which will be considered in this paper. Distance-based models assume a homogeneous population and this is an important limitation. In the case of heterogeneous data, one can adopt a mixture modeling framework to produce more sophisticated models. The EM algorithm (Dempster et al., 1977) can fit the mixture models in a simple and fast manner. Mixture models for ranking data are not new and have been studied extensively in different research, for example Gormley and Murphy (2006, 2008), Croon (1989), Stern (1993) and Moors and Vermunt (2007). However, these studies were not on distance-based models. Recently, Murphy and Martin (2003) and Melia and Chan (2010) extended the use of mixture models to distance-based models and (k 1)-parameter models respectively to describe the presence of heterogeneity among judges. In this way, the limitation of the assumption of a homogeneous population in distance-based models can be relaxed. Inspired by the results of the aforementioned research, we will develop mixtures of weighted distance-based models for ranking data in this paper. The remainder of this paper is organized as follows. In Section 2, we review the distance-based models for ranking data and in Section 3 we explore the properties of the weighted distance-based models. The newly proposed mixtures of weighted distance-based models are explained in Section 4. Simulation studies are carried out to assess the performance of the model fitting and selection algorithm, and the results are presented in Section 5. To illustrate the feasibility of the proposed model, studies of synthetic ranking datasets and a social science ranking dataset on political goals are presented in Section 6. Finally, concluding remarks are given in Section Distance-based models for ranking data 2.1. Distance-based models For a better description of ranking data, some notations must be defined first. In ranking k items, labeled 1,..., k, a ranking π is a mapping function from 1,..., k to 1,..., k, where π(i) is the rank given to item i. For example, π(2) = 3 means that item 2 is ranked third. A distance function is useful in measuring the discrepancy between two rankings. The usual properties of a distance function are: (1): d(π, π) = 0, (2): d(π, σ) > 0 if π σ, and (3): d(π, σ) = d(σ, π). For ranking data, we require that the distance, apart from having these usual properties, must be right invariant, i.e. d(π, σ) = d(π τ, σ τ), where π τ(i) = π(τ(i)). This requirement can ensure that relabeling of items has no effect on the distance. Some popular distances are given in Table 1, where I{} is an indicator function. Apart from these distances, there are other distances for ranking data, and readers can refer to Critchlow et al. (1991) for details. Diaconis (1988) developed a class of distance-based models, P(π λ, π 0 ) = e λd(π,π 0), C(λ) where λ 0 is the dispersion parameter, d(π, π 0 ) is an arbitrary right invariant distance, and C(λ) is the proportionality constant. In the particular case where d is Kendall s tau, the model is named Mallows φ-model (Mallows, 1957). The parameter λ measures how individuals preferences differ from the modal ranking π 0. The probability of observing a ranking π drops when π is moving away from π 0. When λ approaches zero, the distribution of rankings will become uniform Weighted distance-based models In this subsection we will describe in detail the weighted distance-based models proposed in Lee and Yu (2010). Motivated from the weighted Kendall s tau correlation coefficient proposed by Shieh (1998) and Lee and Yu (2010) defined some weighted distances and they are given in Table 2, where w π0 (i) is the weight assigned to item i. Note that the weights are assigned according to the modal ranking π 0, i.e. w 1 is the weight assigned to the item ranked first in π 0. Introducing weights allows different penalties for different mistakenly ranked items, and hence flexibility of the distance-based model increased. Apart from the weighted Kendall s tau (Shieh, 1998) and weighted Spearman rho square (Shieh et al., 2000), there are many other weighted rank correlations proposed, see, for example, Tarsitano (2009).

3 2488 P.H. Lee, P.L.H. Yu / Computational Statistics and Data Analysis 56 (2012) Table 2 Some weighted distances for ranking data. Name Short form Formula Weighted Kendall s tau T w (π, σ π 0 ) w i<j π 0 (i)w π0 ( j)i{[π(i) π( j)][σ (i) σ ( j)] < 0} k Weighted Spearman s rho R w (π, σ π 0 ) w 0.5 i=1 π 0 (i)[π(i) σ (i)] 2 Weighted Spearman s rho square R 2 (π, σ π k w 0) w i=1 π 0 (i)[π(i) σ (i)] 2 Weighted Spearman s footrule F w (π, σ π 0 ) k w i=1 π 0 (i) π(i) σ (i) Table 3 Illustration of the effects of w. Model parameters Correct classification rate of the correctly ranked positions w 1 w 2 w 3 w 4 Item 1 Item 2 Item 3 Item In addition to the properties of distance function explained in Section 2.1, another well-studied property of a distance function is the triangle inequality, i.e. d(π a, π c ) d(π a, π b ) + d(π b, π c ). A distance that satisfies the triangle inequality is called a metric. Some of the metrics used to measure the distance between two rankings are T, R and F. The weighted version of these metrics, i.e. T w, R w and F w also satisfy the triangle inequality, and hence they are metrics too. The proof is shown in Appendix A. Metrics may have an advantage over non-metric distances in modeling ranking data as the rankings can be visualized by points in a metric space and the length between any two points determines the metric distance between two associated rankings. Applying a weighted distance measure d w to the distance-based model, the probability of observing a ranking π under the weighted distance-based ranking model is P(π w, π 0 ) = e d w(π,π 0 π 0 ), C(w) where C(w) is the proportionality constant. Generally speaking, if w i is large, few people will tend to disagree that the item ranked i in π 0 should be ranked i. This is because such disagreement will greatly increase the distance and hence the probability of observing it will become very small. If w i is close to zero, people have little or no preference on how the item ranked i in π 0 is ranked, because the change of its rank will not affect the distance at all. An illustration of the effects of w is given in Table 3. For weighted distance-based models with four items, five scenarios are considered, and the corresponding classification rates of the correctly ranked positions based on true ranking probabilities (with respect to π 0 = (1, 2, 3, 4)) for different items are given. Obviously, items with weight 2 are classified to their rank positions correctly with high probabilities, and items with weight 0.01 are classified correctly to their rank positions with probabilities comparable to random guesses. Very often, the modal ranking π 0 is unknown. If the researchers has a clearer a priori idea how ranks in π 0 should be weighted, they can simply estimate π 0 from data only. Otherwise, by estimating π 0, the weight w is implicitly estimated as well and therefore, the distance d w (π, π 0 π 0 ) is changed through the computations; in other words, the weighted distance is basically driven from the data. This may produce a more flexible weighted distance-based model than the (equallyweighted) distance-based model. 3. Properties of weighted distance-based models 3.1. Relationship between weighted distance-based models and (k 1) parameter models Fligner and Verducci (1986) showed that Kendall s tau can be decomposed into (k 1) independent metrics: where T(π, π 0 ) = V π0 (i) = k 1 π 0 (i)=1 k π 0 ( j)=π 0 (i)+1 V π0 (i), I{[π(i) π( j)][π 0 (i) π 0 ( j)] > 0}.

4 P.H. Lee, P.L.H. Yu / Computational Statistics and Data Analysis 56 (2012) Among the four distance functions stated in Section 2.1, it is found that only Kendall s tau can have such multistage representation. By applying dispersion parameter λ i on stage V i, the Mallow s φ-model is extended to: P(π λ, π 0 ) = e k 1 π 0 (i)=1 λ π 0 (i) V π 0 (i) C(λ), where λ = {λ i, i = 1,..., k 1} and C(λ) is the proportionality constant. These models were named (k 1) parameter models in Fligner and Verducci (1986). Mallow s φ-models are special cases of φ-component models when λ 1 = = λ k 1. The (k 1) parameter models belong to the class of multistage models (Fligner and Verducci, 1988), but do not belong to the class of distance-based models, as the symmetric property of distance is lost. Notice here that the so-called distance between rankings π and σ in (k 1) parameters models can be expressed as π 0 (i)<π 0 ( j) λ π 0 (i)i{[π(i) π( j)][σ (i) σ ( j)] > 0}, which is obviously not symmetric, and hence it is not a proper distance measure. The weighted tau models proposed in Lee and Yu (2010) belong to distance-based models, because the weighted tau function is a proper distance function. Furthermore, the weighted tau models can also retain the multistage nature of (k 1) parameter models. The ranking, under weighted tau models, can be decomposed into (k 1) stages V 1,..., V k 1 : T w (π, π 0 π 0 ) = where V i, equals V π 0 (i) = k k 1 π 0 (i)=1 π 0 ( j)=π 0 (i)+1 w π0 (i)v π 0 (i), w π0 ( j)i{[π(i) π( j)][π 0 (i) π 0 ( j)] > 0}. In principle, if the distance used in a distance-based ranking model does not satisfy the symmetric property, it is still a ranking model but it does not belong to the class of distance-based ranking models. In addition to the symmetric property of a distance, Hennig and Hausdorf (2006) commented that the invariance property should be a concern as well Other properties of weighted distance-based models As defined in Critchlow et al. (1991), some properties for ranking models are (1) label-invariance, (2) reversibility, (3) L-decomposability, (4) strong unimodality and (5) complete consensus. The definitions of the properties are given in Appendix B. It is natural to see that property (1) is essential to all statistical models for ranking data, and weighted distance-based models with distance types T w, R w, R 2 w and F w all satisfy (1). However, some models do not satisfy properties (2) (5). In particular, all our proposed weighted distance-based models satisfy property (2) (this can be easily verified) and only models with weighted distances R 2 w and F w satisfy (3) (the proof is given in Appendix C). All our proposed models do not satisfy properties (4) and (5) unless all the weights are the same, i.e. distance are unweighted. However, rankings that violate properties (4) and (5) are commonly seen. Consider the song dataset from Critchlow et al. (1991) in which 98 students were asked to rank 5 words, (1) score, (2) instrument, (3) solo, (4) benediction and (5) suit, according to the association with the word song. Only part of the data is given in Critchlow et al. (1991) and we fit a F model, a F w model, and a (k 1)-parameter model for comparison. The details are given in Table 4. The F w model gives the best fit as it has the lowest BIC value. In particular, F w model give the best fit to the ranking , but F and (k 1)-parameter models give a relatively poor fit. One possible reason is that the data seem violating property (4) (and hence violating property (5)). Note that all models got the same modal ranking and item 1 is thus less preferred than item 2 in the modal ranking. It is interesting to examine the ranking pair ( , ). Under the strong unimodality property, P( ) should be less than P( ). However, the observed rankings seem violating such property. In fact, such violation also occurs in many other ranking pairs in the data. As the weighted distance-based model F w does not satisfy property (4), it could give a better fit to the data than the models satisfying property (4). Through the extension using weighted distance, our proposed weighted distance-based models provide a wider class of ranking models. They may give a better fit to the data when properties (4) and/or (5) are not satisfied. If a transposition of two particular items in π 0 yields a large drop in probability compare to P(π 0 ), while a transposition of another two items does not lead to a large reduction in probability, the classical distance-based model will give a poor fit but the weighted distance-based model can give a better fit. The flexibility is illustrated in Table 4. For rankings ( ) and ( ), while they both have a Footrule distance of 4 with the modal ranking , their observe frequencies are quite different. It is clear that F w model is more flexible than F model.

5 2490 P.H. Lee, P.L.H. Yu / Computational Statistics and Data Analysis 56 (2012) Table 4 Details of F, F w, and (k 1)-parameter models for the song dataset. Ordering Observed frequency Expected frequency (F) Expected frequency (F w ) Expected frequency (k 1) Others Total 83 BIC Mixtures of weighted distance-based models 4.1. Mixture modeling framework If a population contains G sub-populations with probability mass function (pmf) P g (x), and the proportion of subpopulation g equals p g, the pmf of the mixture model is P(x) = G p g P g (x). g=1 Hence, the probability of observing a ranking π under a mixture of G weighted distance-based ranking models is: G G e d wg (π,π 0g π 0g ) P(π) = p g P(π w g, π 0g ) = p g. C(w g ) g=1 g=1 And the loglikelihood for n observations is: n G e d wg (π i,π 0g π 0g ) l = log p g, C(w g ) i=1 g=1 where w g and π 0g are the parameters of the weighted distance-based models of sub-population g. A noise component, all weights being zero, are sometimes included to represent a sub-population with completely uncertain rank-order preferences. This could happen in the situations where the population is very large and adding a noise component could help single out the sub-population with random preferences. Estimating the model parameters by direct maximization of the loglikelihood function may lead to a high-dimensional numerical optimization problem. Instead, maximization can be achieved by applying the EM algorithm (Dempster et al., 1977). In short, the E-step of an EM algorithm computes, for all observations, the probabilities of belonging to every sub-population, and the M-step maximizes the conditional expected complete-data loglikelihood given the estimates generated in E-step. To derive the EM algorithm, we define a latent variable z i = (z 1i,..., z Gi ) as: z gi = 1 if observation i belongs to subpopulation g, otherwise z gi = 0. The complete-data loglikelihood is: L com = n i=1 g=1 G z gi [log(p g ) d wg (π i, π 0g π 0g ) log(c(w g ))]. First of all, we will choose the initial parameters for w g, π 0g and p g. Then we will alternatively run the E-step and M-step until the parameters converge. In the E-step, ẑ gi, g = 1, 2,..., G are updated for observations i = 1, 2,..., n, by ẑ gi = ˆp gp(π i ŵ g, ˆπ 0g ). G ˆp h P(π i ŵ h, ˆπ 0h ) h=1 In the M-step, model parameters are updated by maximizing the complete-data loglikelihood with z gi replaced by ẑ gi. The MLE of ˆπ 0g and ŵ g are obtained simultaneously (Fligner and Verducci, 1986). For a given g = 1,..., G, ˆπ 0g is obtained

6 P.H. Lee, P.L.H. Yu / Computational Statistics and Data Analysis 56 (2012) by exhaustive search algorithm. And then ŵ g, satisfies the following equation (Murphy and Martin, 2003, pp. 648, Eq. (5)) n ẑ gi d wg (π i, π 0g π 0g ) k! i=1 = P(π n j w g, π 0g )d wg (π j, π 0g π 0g ), ẑ j=1 gi i=1 is obtained. Using the latest weights, the ˆπ 0g is recomputed. The model fitting procedure stops when ˆπ 0g does not change anymore. Based on our limited experience, we found that the parameter estimates are not sensitive to the initialization. Therefore in this paper, random numbers drawn from uniform (0, 1), 1, and the ranking sorted according to the mean rank were used G as initial values for w, p, and π 0 respectively. It is found that the EM algorithm can converge within 20 iterations in most cases of our simulation studies and applications. There will be two major difficulties in fitting weighted distance-based models when k is large. First, the global search algorithm for MLE ˆπ 0g is not practical because the number of possible choices is too large. Instead, as suggested in Busse et al. n (2007), a local search algorithm should be used. They suggested computing the sum of distances i=1 z gid wg (π i, ˆπ 0g π 0g ) for all ˆπ 0g Π, where Π is a set containing all rankings having a Cayley distance of 0/1 to the initial ranking. A reasonable choice of initial ranking can be constructed using mean rank. Second, the numerical computation of the proportionality constant C(w g ) is time consuming. C(w g ) is the summation of e d wg (π,π 0g π 0g ) over all possible π and its computational time increases exponentially with the number of items, k. For small k, the proportionality constant can be computed efficiently by summing over all rankings. For large k, Lebanon and Lafferty (2002) proposed an MCMC algorithm for fitting (unweighted) distance-based models, and the simulation study in Klementiev et al. (2008) showed that the performance of this estimation technique is acceptable for k = 10. Similar methods can be extended to the weighted distance-based models. To determine the number of components in the mixture, we use the Bayesian information criterion (BIC). The BIC equals 2l + v log(n) where l is the maximized loglikelihood, n is the sample size and v is the number of model parameters. The model with the smallest BIC is chosen to be the best model. Murphy and Martin (2003) showed that the BIC worked quite well if there is no noise component in the mixed population. Furthermore, Biernacki et al. (2006) showed that, for a large family of mixtures (with different number of mixing components and the underlying densities of the various components), the BIC criterion is consistent, and the BIC has been shown to be efficient on practical grounds Assessment of the goodness-of-fit As opposed to the criteria derived from the log-likelihood, it has been argued that the log-likelihood values for essentially different models with different implicit flexibility are not directly comparable (Amemiya, 1985; Cox and Hinkley, 1974). Given these opposing views, we would also rely on the goodness-of-fit measure to assess the model performance. To assess the goodness-of-fit of the model, we use the sum of squares Pearson residuals (χ 2 ) suggested by Lee and Yu (2010). χ 2 k! equals i r 2 i, where r i = (O i E i ) Ei is the Pearson residual, and O i and E i are observed and expected frequencies of ranking i respectively. However, if the size of some E i is smaller than 5, the computed chi-square statistic will be biased. We are likely to encounter this problem when the size of the dataset is small and k is large. In this case, we suggest using the truncated sum of squares Pearson residuals criterion described in Erosheva et al. (2007). 5. Simulation studies In this section, three simulation results are reported. The first simulation studies the performance of the estimation algorithm of our weighted distance-based models, the second simulation investigates the effectiveness of using BIC in selecting the number of components, and the third simulation compares the performance of the T w model and the (k 1)- parameter model, both derived from the T model. In the first simulation, datasets of ranking of 4 items, with sample size of 2000 each, were simulated to study the accuracy of model fitting. We considered 4 models (the parameters of the models are listed in Tables 5 and 6) based on weighted Kendall s tau distance in this simulation. The modal rankings were the same in Models 1 and 2, but the dispersion in Model 1 was comparatively larger. The modal rankings were the same in Models 3 and 4 as well, and again the dispersion in Model 3 (second component) was comparatively larger. The initial values for ŵ were randomly drawn from uniform (0, 1). The simulation results, based on 50 replications, are summarized in Tables 7 and 8. Empirical standard deviations of the parameter estimates are provided inside parentheses. There are two observations from these simulation results. First, the model estimates are very close to their actual values, indicating that our proposed algorithm works well for fitting mixtures of weighted distance-based models for ranking data. Second, estimates are more accurate for models with larger weights, as these estimates have smaller standard deviations. It is because for models with large weights, observations tend to be concentrated around the modal ranking, and hence

7 2492 P.H. Lee, P.L.H. Yu / Computational Statistics and Data Analysis 56 (2012) Table 5 Simulation settings of Models 1 and 2. Model π 0 w 1 w 2 w 3 w Table 6 Simulation settings of Models 3 and 4. Model p π 0 w 1 w 2 w 3 w Table 7 First simulation results. π 0 Model 1 Model w (0.059) (0.081) w (0.055) (0.089) w (0.032) (0.035) w (0.013) (0.030) Table 8 First simulation results (cont d). π 0 Model 3 Model p (0.007) (0.028) w (0.129) (0.123) (0.232) (0.158) w (0.121) (0.107) (0.173) (0.174) w (0.063) (0.065) (0.182) (0.072) w (0.035) (0.025) (0.050) (0.072) Table 9 Second simulation results. The frequencies of mixture models selected using BIC were recorded. Model N N N the probability distribution is different from a uniform distribution in which the standard error of the central tendency is expected to be the largest. In the second simulation, we will use the four models described in the first simulation. Mixtures of weighted distancebased models with G = 1, 2, 3 mixing components were fitted. We repeated this process 50 times and recorded the frequencies that the best mixture model was selected according to BIC. The results are shown in Table 9. The +N notation indicates an additional noise component (uniform model), i.e., w = 0. Note that the model q+n is a mixture with q+1 subpopulations in which one of them is a noise component. The simulation results show that the BIC can identify correctly the number of components, and the performance improves for models with larger weights since they have more observations concentrated at the modal ranking. However, the BIC sometimes suggests including an additional noise component, probably because there is only one parameter in the noise component, and hence the improvement in loglikelihood is less penalized. In the third simulation, we compare the model performance of the T w model and the (k 1)-parameter model in terms of flexibility. Two settings were simulated, one being a (k 1)-parameter model with parameters (1, 0.75, 0.5) and the other being a T w model with parameters (2, 1.5, 0.2, 0.2). The simulation were replicated 50 times, and at each time, a (k 1)-parameter model and a weighted tau model were fitted to the data. Figs. 1 and 2 shows that, when the underlying model is a (k 1)-parameter model, the performance of T w model (in terms of both BIC and sum of square Pearson residual) was very close to that of the (k 1)-parameter model. On the other hand, when the underlying model is a T w model, the T w model outperformed the (k 1)-parameter model. This simulation shows that the T w model is more flexible than the (k 1)-parameter model.

8 P.H. Lee, P.L.H. Yu / Computational Statistics and Data Analysis 56 (2012) Fig. 1. Box plot of the BIC of 50 replications. The two ranking data on the left are generated from a (k 1) parameter model and other two ranking data on the right are generated from a T w model. Fig. 2. Box plot of the sum of square Pearson residual of 50 replications. The two ranking data on the left are generated from a (k 1)-parameter model and other two ranking data on the right are generated from a T w model. 6. Applications We will apply our mixture models to six ranking datasets. The first five are synthetic datasets from Cheng et al. (2009), where they turned some multi-class and regression datasets from UC Irvine repository and Statlog collection to ranking datasets, using a naive Bayes classifier approach. The last dataset is a real world ranking dataset from a political study described in Croon (1989) Application to synthetic data We consider the synthetic ranking datasets with 4 5 items studied in Cheng et al. (2009). Information of the five datasets is given in Table 10. We compare the performance of our proposed mixtures of weighted distance-based models (T w, R w, R 2 w and F w) with the existing mixtures of distance-based models (Murphy and Martin, 2003). Besides mixtures models with heterogeneous π g and w g, we could also consider models with heterogeneity only on π g but constant w across all mixtures. We denote such models with G mixtures as G throughout the paper. The results are shown in Table 11. The BIC values in bold type represent the best model for that dataset. It is clear that for all types of distance measures, the BIC values for the weighted distancebased model are always smaller than their unweighted counterparts. Furthermore, even though not provided here, weighted distance-based models are always better than their unweighted counterparts in terms of BIC for number of components

9 2494 P.H. Lee, P.L.H. Yu / Computational Statistics and Data Analysis 56 (2012) Table 10 Information of the five synthetic ranking datasets. Dataset Size Number of items Authorship Calhousing Cpu-small Stock Vehicle Table 11 BIC and number of components of unweighted/weighted models. Dataset Distance/k 1 BIC # Mixture Weighted distance BIC # Mixture Authorship Calhousing Cpu-small Stock Vehicle T T w R R w R R 2 w F F w k T T w N R R w R R 2 w F F w k T T w R R w R R 2 w F F w k T T w R R w R R 2 w F F w k T T w R R w R R 2 w F F w k between 1 and 5. Therefore, we can conclude that our weighted distance-based models could provide a better fit than the (unweighted) distance-based models Application to real data: social science research on political goals To illustrate the applicability of the weighted distance-based models described in Section 3, we study the ranking dataset obtained from Croon (1989). It consisted of 2262 rankings of four political goals for the Government collected from a survey conducted in Germany. The four goals were: (A) maintain order in nation, (B) give people more say in Government decisions, (C) fight rising prices, (D) protect freedom of speech. The respondents were classified into three value priority groups according to their top two choices (Inglehart, 1977). Materialist corresponds to individuals who gave priority to (A) and (C) regardless of the ordering, whereas those who chose (B) and (D) were classified as post-materialist. The last category comprised of respondents giving all the other combinations of rankings and they were classified as holding mixed value orientations. Weighted distance-based models were fitted for four types of weighted distances (T w, R w, R 2 w and F w), with mixing components G = N, 1,..., 3 + N and 4. The BIC values are listed in Table 12. The underlined BIC values represent the best number of components within each distance type, and the BIC value in bold type represents the best mixture model. Finally, we find that the best model is the weighted footrule with G = 3. The BIC is , which is better than the strict utility (SU) model ( , parameter estimates provided in Table 14, interested readers are suggested to read Croon (1989) for interpretation of the SU model) and the Pendergrass-Bradley (PB) model ( ) discussed in Croon (1989).

10 P.H. Lee, P.L.H. Yu / Computational Statistics and Data Analysis 56 (2012) Table 12 BIC of mixture models. # Mixture Distance k 1 Weighted distance T R R 2 F T w R w R 2 w F w N N N N N N Table 13 Parameters of weighted footrule mixture model. Group Ordering of goals in π 0 p w 1 w 2 w 3 w 4 1 C A B D A C B D B D C A Table 14 Parameters of SU mixture model. An item with a larger parameter implies that this item is more preferred. Group p A B C D It is undoubtedly better than the best (unweighted) distance-based model ( , footrule). For all types of distances, both unweighted and weighted, the lowest BIC appear when G = 3 or 3 + N. The parameter estimates of the best model, mixture of three weighted footrule models, are shown in Table 13. The first two groups, which comprised 79% of respondents, were materialists as they ranked (A) and (C) as more important than the other two goals. The third group was post-materialists as people in this group ranked (B) and (D) as more important. Base on our grouping, Inglehart s theory is not too appropriate in Germany. We should at least distinguish the two types of materialists, one ranking (A) higher than (C), and the other (C) higher than (A). This conclusion is similar to the findings in Croon (1989) and Moors and Vermunt (2007). However, the mixture solution we obtained here is slightly different from the SU mixture solution of Croon. This can be evidenced by visualizing the data via a truncated octahedron (Thompson, 1993). This visualizing technique enables us to have a better understanding of the ranking distribution. All ranking frequencies can be presented in a way that similar rankings are closer in the truncated octahedron. An illustration of the truncated octahedrons is shown in Fig. 3. The 24 rankings are placed on the vertices in a way that the edges represent an adjacent transposition. Among all six hexagon surfaces, there are four surfaces where all vertices have the same top choice. For example, the hexagon surface facing the readers represents the six rankings with top choice C. The dot size is relative to the proportion of the ranking frequencies. Rankings which constitute more than 5% of the corresponding group are dotted. Figs. 4 and 5 show the predicted distributions of F w and SU mixtures models respectively. It can be seen that the three mixtures produced using F w distance are more separated, as the difference between group 1 and 2 in F w mixture model is much more clearer than that in SU mixture model. Detailed frequency tables are provided (Tables 15 and 16). For groups 1 and 2, weights w 3 and w 4 are close to zero while w 1 and w 2 are much larger, indicating that observations from groups 1 and 2 are mainly C A?? and A C?? respectively. As compared with that in groups 1 and 2, the weights in group 3 are relatively closer to zero. This implies that people belonging to this group were less certain about their preferences than people in the other groups. The weight of item A is the largest in group 3, which means A has a relatively high probability to be ranked the last, and it can be seen in Table 16 as well. Although both models suggest a three-mixture solution and their χ 2 values are very close, the constituent of the three mixtures are quite different. In SU mixture the estimated proportions of groups 1 and 2 are and respectively. Compare with SU mixture solution, our solution has a higher estimated proportion of group 2 (0.441). This difference is mainly due to the difference in assigning rankings A C B D and A C D B to group 2. At the first glance, these two rankings should be assigned to group 2. Referring to Tables 15 and 16, our mixture model assigns approximately 96%

11 2496 P.H. Lee, P.L.H. Yu / Computational Statistics and Data Analysis 56 (2012) Fig. 3. A truncated octahedron representing rankings of 4 items. Fig. 4. The truncated octahedron representation of F w mixture model. The three truncated octahedrons represent mixtures having central rankings C A B D, A C B D and B D C A, respectively. Rankings with frequency greater than 5% of the total mixture size are plotted. Comparing with Fig. 3, it is trivial that F w mixtures are much more pure. Fig. 5. The truncated octahedron representation of SU mixture model. of these two rankings to group 2, while for SU mixture model the percentage drops to 63%. The grouping of our weighted footrule mixture model sounds more reasonable than that of the SU mixture model. 7. Conclusion 7.1. Concluding remarks We proposed a new class of distance-based models using weighted distance measures for ranking data. The models assumed that the population is formed by aggregation of some homogeneous sub-populations. Furthermore, we assumed that the rankings observed in each sub-population follow a weighted distance-based model. The weighted distance-based ranking models we proposed in this paper can keep the nature of distance, and at the same time maintain a greater flexibility. Properties of the weighted distance were studied, in particular the relationship between weighted tau and the distance used in (k 1)-parameter models. Although our proposed weighted distancebased models do not satisfy some of the statistical properties of ranking models, we found that our models have indeed advantages. Simulation results showed that the algorithm can accurately estimate the model parameters, and the BIC is appropriate in the model selection. Applications to both synthetic data and real data showed that our extended weighted

12 P.H. Lee, P.L.H. Yu / Computational Statistics and Data Analysis 56 (2012) Table 15 Observed and expected frequencies of SU mixture models. Mixtures 1, 2 and 3 represent mixtures having central rankings C A B D, A C B D and B D C A, respectively. Ordering Observed frequency Expected frequency (SU) Total ABCD ABDC ACBD ACDB ADBC ADCB BACD BADC BCAD BCDA BDAC BDCA CABD CADB CBAD CBDA CDAB CDBA DABC DACB DBAC DBCA DCAB DCBA χ Table 16 Observed and expected frequencies of F w mixture models. Although both models suggest 3 mixtures solution and their χ 2 values are very close, the constituent of the 3 mixtures are quite different. Ordering Observed frequency Expected frequency (F w ) Total ABCD ABDC ACBD ACDB ADBC ADCB BACD BADC BCAD BCDA BDAC BDCA CABD CADB CBAD CBDA CDAB CDBA DABC DACB DBAC DBCA DCAB DCBA χ models could fit the data better than their corresponding equally-weighted counterparts. Furthermore, the interpretation of the model was kept simple and straightforward too. The mixtures of weighted distance-based models are not without limitations. For ranking data with many items, modeling of weights on the less-preferred items may not be necessary as they may be indifferent to the judges. Also, the computational burden will increase exponentially when the number of items becomes larger.

13 2498 P.H. Lee, P.L.H. Yu / Computational Statistics and Data Analysis 56 (2012) Further research direction An area of future research could be the development of the proposed mixtures of weighted distance-based models for partially ranked data. Partial ranking π occurs when only q of the k items (q < k 1) are ranked. Partial ranking is commonly seen in survey studies, where the number of items is very large and/or only the top few preferred items are of particular interest. Extending weighted distance-based models for partially ranked data would greatly widen their applicability. The (k 1)-parameter model was extended for partially ranked data recently by Melia and Bao (2010) and the extension may be applied to weighted distance-based models. It is also of interest to develop computationally efficient algorithms of computing the proportionality constant in the weighted distance-based models for ranking on large number of items. Acknowledgments The research of Philip L. H. Yu was supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. HKU 7473/05H). We thank the associate editor and three anonymous referees for their helpful suggestions for improving this article. Appendix A. Proof of triangle inequality in T w, R w and F w Theorem 1. T w, R w and F w satisfy the triangle inequality. Proof. Weighted tau T w counts the weighted disagreement of every possible item pairs in the two rankings. Without loss of generality, assume π a (i) < π a ( j). For this particular item pair (i, j) there are four possibilities: π b (i) < π b ( j) and π c (i) < π c ( j), π b (i) < π b ( j) and π c (i) > π c ( j), π b (i) > π b ( j) and π c (i) < π c ( j), and π b (i) > π b ( j) and π c (i) > π c ( j). It can be easily confirmed that the contribution to LHS of the inequality must be less than or equal to the contribution to RHS of the inequality, under all four possibilities. Spearman s rho R w can be expressed as k R w (π a, π b ) = i=1 w π0 (i)[π a (i) π b (i)] k = [u a (i) u b (i)] 2, i=1 where u a (i) = w π0 (i)π a (i) and u b (i) = w π0 (i)π b (i). As R w has a form similar to Euclidean distance, it satisfies the triangle inequality. Weighted footrule F w can be decomposed to F w = k w π0 (i)f i, i=1 where every F i satisfies the triangle inequality. We can conclude that F w satisfies the triangle inequality. Appendix B. Definition of properties of ranking models (1) Label-invariance. The relabeling of items has no effect on the probability models. (2) Reversibility. A reverse function γ (π) for a ranking of k items is defined as γ (i) = k + 1 i. Reversing the ranking π has no effect on the probability models.

14 P.H. Lee, P.L.H. Yu / Computational Statistics and Data Analysis 56 (2012) (3) L-decomposability. The ranking of k items can be decomposed into k 1 stages. At stage i, where i = 1, 2,..., k 1, the best among the items remaining at that stage is selected, and then this item will be removed in the following stages. (4) Strong unimodality (weak transposition property). A transposition function τ ij is defined as τ(i) = j, τ( j) = i, τ(m) = m for all m i, j. With modal ranking π 0, for every pair of items i and j such that π 0 (i) < π 0 ( j) and every π such that π(i) = π( j) 1, P(π) P(π τ ij ), with equality attained at π = π 0. It guarantees the probability is non-increasing as π moves one step away from π 0, for items having adjacent ranks. (5) Complete consensus (transposition property). As compared with the strong unimodality, complete consensus is an even stronger property which guarantees for every pair of items (i, j) such that π 0 (i) < π 0 ( j) and every π such that π(i) < π( j), P(π) P(π τ ij ). From this definition, we can see that complete consensus implies strong unimodality. Appendix C. Properties of weighted distance-based models Theorem 2. R 2 w and F w are L-decomposable. Proof. Critchlow et al. (1991) showed that, a ranking model is L-decomposable if there exists functions f r, r = 1,..., k such that k d(π, e) = f r [π 1 (r)]. r=1 Therefore, R 2 w and F are L-decomposable as and k R 2 w (π, e) = F w (π, e) = r=1 w π0 (π 1 (r))[r π 1 (r)] 2 k w π0 (π 1 (r)) r π 1 (r). r=1 References Amemiya, T., Advanced Econometrics. Harvard University Press. Biernacki, C., Celeux, G., Govaert, G., Langrognet, F., Model-based cluster and discriminant analysis with the mixmod software. Computational Statistics and Data Analysis 51, Busse, L.M., Orbanz, P., Buhmann, J.M., Cluster analysis of heterogeneous rank data. In: Proceedings of the 24th International Conference on Machine Learning, pp Cheng, W., Huhn, J., Hullermeier, E., Decision tree and instance-based learning for label ranking. In: Proceedings of the 26th International Conference on Machine Learning. Cox, D.R., Hinkley, D.V., Theoretical Statistics. Chapman and Hall, London. Critchlow, D.E., Metric Methods for Analyzing Partially Ranked Data. In: Lecture Notes in Statistics, vol. 34. Springer, Berlin. Critchlow, D.E., Fligner, M.A., Verducci, J.S., Probability models on rankings. Journal of Mathematical Psychology 35, Croon, M.A., Latent class models for the analysis of rankings. In: Soete, G.D., Feger, H., Klauer, K.C. (Eds.), New Developments in Psychological Choice Modeling. Elsevier Science, North-Holland, pp Dempster, A.P., Laird, N.M., Rubin, D.B., Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B 39, Diaconis, P., Group Representations in Probability and Statistics. Institute of Mathematical Statistics, Hayward. Erosheva, E.A., Fienberg, S.E., Joutard, C., Describing disability through individual-level mixture models for multivariate binary data. The Annals of Applied Statistics 1, Fligner, M.A., Verducci, J.S., Distance based ranking models. Journal of the Royal Statistical Society: Series B 48, Fligner, M.A., Verducci, J.S., Multi-stage ranking models. Journal of the American Statistical Association 83, Gormley, I.C., Murphy, T.B., Analysis of Irish third-level college application data. Journal of the Royal Statistical Society: Series A 169, Gormley, I.C., Murphy, T.B., Exploring voting blocs within the Irish electorate: a mixture modeling approach. Journal of American Statistical Association 103, Hennig, C., Hausdorf, B., Design of dissimilarity measures: a new dissimilarity measure between species distribution areas. In: Batagelj, V., Bock, H.H., Ferligoj, A., Ziberna, A. (Eds.), Data Science and Classification, pp Inglehart, R., The Silent Revolution: Changing Values and Political Styles Among Western Publics. Princeton University Press, Princeton. Klementiev, A., Roth, D., Small, K., Unsupervised rank aggregation with distance-based models. In: Proceedings of the 25th International Conferencece on Machine Learning.

Computational Statistics and Data Analysis

Computational Statistics and Data Analysis Computational Statistics and Data Analysis 54 (2010) 1672 1682 Contents lists available at ScienceDirect Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda Distance-based

More information

Math for Liberal Studies

Math for Liberal Studies Math for Liberal Studies We want to measure the influence each voter has As we have seen, the number of votes you have doesn t always reflect how much influence you have In order to measure the power of

More information

Lecture 3 : Probability II. Jonathan Marchini

Lecture 3 : Probability II. Jonathan Marchini Lecture 3 : Probability II Jonathan Marchini Puzzle 1 Pick any two types of card that can occur in a normal pack of shuffled playing cards e.g. Queen and 6. What do you think is the probability that somewhere

More information

Finding extremal voting games via integer linear programming

Finding extremal voting games via integer linear programming Finding extremal voting games via integer linear programming Sascha Kurz Business mathematics University of Bayreuth sascha.kurz@uni-bayreuth.de Euro 2012 Vilnius Finding extremal voting games via integer

More information

Set Notation and Axioms of Probability NOT NOT X = X = X'' = X

Set Notation and Axioms of Probability NOT NOT X = X = X'' = X Set Notation and Axioms of Probability Memory Hints: Intersection I AND I looks like A for And Union U OR + U looks like U for Union Complement NOT X = X = X' NOT NOT X = X = X'' = X Commutative Law A

More information

Do in calculator, too.

Do in calculator, too. You do Do in calculator, too. Sequence formulas that give you the exact definition of a term are explicit formulas. Formula: a n = 5n Explicit, every term is defined by this formula. Recursive formulas

More information

Chapter 1 Problem Solving: Strategies and Principles

Chapter 1 Problem Solving: Strategies and Principles Chapter 1 Problem Solving: Strategies and Principles Section 1.1 Problem Solving 1. Understand the problem, devise a plan, carry out your plan, check your answer. 3. Answers will vary. 5. How to Solve

More information

Lecture 2 31 Jan Logistics: see piazza site for bootcamps, ps0, bashprob

Lecture 2 31 Jan Logistics: see piazza site for bootcamps, ps0, bashprob Lecture 2 31 Jan 2017 Logistics: see piazza site for bootcamps, ps0, bashprob Discrete Probability and Counting A finite probability space is a set S and a real function p(s) ons such that: p(s) 0, s S,

More information

Discrete Probability Distributions

Discrete Probability Distributions 37 Contents Discrete Probability Distributions 37.1 Discrete Probability Distributions 2 37.2 The Binomial Distribution 17 37.3 The Poisson Distribution 37 37.4 The Hypergeometric Distribution 53 Learning

More information

Unsupervised Rank Aggregation with Distance-Based Models

Unsupervised Rank Aggregation with Distance-Based Models Unsupervised Rank Aggregation with Distance-Based Models Kevin Small Tufts University Collaborators: Alex Klementiev (Johns Hopkins University) Ivan Titov (Saarland University) Dan Roth (University of

More information

9/2/2009 Comp /Comp Fall

9/2/2009 Comp /Comp Fall Lecture 4: DNA Restriction Mapping Study Chapter 4.1-4.34.3 9/2/2009 Comp 590-90/Comp 790-90 Fall 2009 1 Recall Restriction Enzymes (from Lecture 2) Restriction enzymes break DNA whenever they encounter

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

discrete probability distributions

discrete probability distributions 37 Contents discrete probability distributions 1. Discrete probability distributions. The binomial distribution 3. The Poisson distribution 4. The hypergeometric distribution Learning outcomes In this

More information

Lecture 4: DNA Restriction Mapping

Lecture 4: DNA Restriction Mapping Lecture 4: DNA Restriction Mapping Study Chapter 4.1-4.3 8/28/2014 COMP 555 Bioalgorithms (Fall 2014) 1 Recall Restriction Enzymes (from Lecture 2) Restriction enzymes break DNA whenever they encounter

More information

Lecture 4: DNA Restriction Mapping

Lecture 4: DNA Restriction Mapping Lecture 4: DNA Restriction Mapping Study Chapter 4.1-4.3 9/3/2013 COMP 465 Fall 2013 1 Recall Restriction Enzymes (from Lecture 2) Restriction enzymes break DNA whenever they encounter specific base sequences

More information

8/29/13 Comp 555 Fall

8/29/13 Comp 555 Fall 8/29/13 Comp 555 Fall 2013 1 (from Lecture 2) Restriction enzymes break DNA whenever they encounter specific base sequences They occur reasonably frequently within long sequences (a 6-base sequence target

More information

Discrete Probability Distributions

Discrete Probability Distributions Chapter 4 Discrete Probability Distributions 4.1 Random variable A random variable is a function that assigns values to different events in a sample space. Example 4.1.1. Consider the experiment of rolling

More information

Determining the number of components in mixture models for hierarchical data

Determining the number of components in mixture models for hierarchical data Determining the number of components in mixture models for hierarchical data Olga Lukočienė 1 and Jeroen K. Vermunt 2 1 Department of Methodology and Statistics, Tilburg University, P.O. Box 90153, 5000

More information

Unsupervised machine learning

Unsupervised machine learning Chapter 9 Unsupervised machine learning Unsupervised machine learning (a.k.a. cluster analysis) is a set of methods to assign objects into clusters under a predefined distance measure when class labels

More information

Finite Mixture Model of Bounded Semi-naive Bayesian Networks Classifier

Finite Mixture Model of Bounded Semi-naive Bayesian Networks Classifier Finite Mixture Model of Bounded Semi-naive Bayesian Networks Classifier Kaizhu Huang, Irwin King, and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin,

More information

Data Preprocessing. Cluster Similarity

Data Preprocessing. Cluster Similarity 1 Cluster Similarity Similarity is most often measured with the help of a distance function. The smaller the distance, the more similar the data objects (points). A function d: M M R is a distance on M

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1

More information

Click Prediction and Preference Ranking of RSS Feeds

Click Prediction and Preference Ranking of RSS Feeds Click Prediction and Preference Ranking of RSS Feeds 1 Introduction December 11, 2009 Steven Wu RSS (Really Simple Syndication) is a family of data formats used to publish frequently updated works. RSS

More information

Properties of the Bayesian Knowledge Tracing Model

Properties of the Bayesian Knowledge Tracing Model Properties of the Bayesian Knowledge Tracing Model BRETT VAN DE SANDE Arizona State University bvds@asu.edu Bayesian knowledge tracing has been used widely to model student learning. However, the name

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Mixtures of Rasch Models

Mixtures of Rasch Models Mixtures of Rasch Models Hannah Frick, Friedrich Leisch, Achim Zeileis, Carolin Strobl http://www.uibk.ac.at/statistics/ Introduction Rasch model for measuring latent traits Model assumption: Item parameters

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2 Agenda Expectation-maximization

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

10.1. Randomness and Probability. Investigation: Flip a Coin EXAMPLE A CONDENSED LESSON

10.1. Randomness and Probability. Investigation: Flip a Coin EXAMPLE A CONDENSED LESSON CONDENSED LESSON 10.1 Randomness and Probability In this lesson you will simulate random processes find experimental probabilities based on the results of a large number of trials calculate theoretical

More information

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages: Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the

More information

An Introduction to Statistical and Probabilistic Linear Models

An Introduction to Statistical and Probabilistic Linear Models An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning

More information

Model-based cluster analysis: a Defence. Gilles Celeux Inria Futurs

Model-based cluster analysis: a Defence. Gilles Celeux Inria Futurs Model-based cluster analysis: a Defence Gilles Celeux Inria Futurs Model-based cluster analysis Model-based clustering (MBC) consists of assuming that the data come from a source with several subpopulations.

More information

A Framework for Unsupervised Rank Aggregation

A Framework for Unsupervised Rank Aggregation SIGIR 08 LR4IR Workshop A Framework for Unsupervised Rank Aggregation Alexandre Klementiev, Dan Roth, and Kevin Small Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL

More information

Label Switching and Its Simple Solutions for Frequentist Mixture Models

Label Switching and Its Simple Solutions for Frequentist Mixture Models Label Switching and Its Simple Solutions for Frequentist Mixture Models Weixin Yao Department of Statistics, Kansas State University, Manhattan, Kansas 66506, U.S.A. wxyao@ksu.edu Abstract The label switching

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information

secretsaremadetobefoundoutwithtime UGETGVUCTGOCFGVQDGHQWPFQWVYKVJVKOG Breaking the Code

secretsaremadetobefoundoutwithtime UGETGVUCTGOCFGVQDGHQWPFQWVYKVJVKOG Breaking the Code Breaking the Code To keep your secret is wisdom; but to expect others to keep it is folly. Samuel Johnson Secrets are made to be found out with time Charles Sanford Codes have been used by the military

More information

12.1. Randomness and Probability

12.1. Randomness and Probability CONDENSED LESSON. Randomness and Probability In this lesson, you Simulate random processes with your calculator Find experimental probabilities based on the results of a large number of trials Calculate

More information

MARGINAL HOMOGENEITY MODEL FOR ORDERED CATEGORIES WITH OPEN ENDS IN SQUARE CONTINGENCY TABLES

MARGINAL HOMOGENEITY MODEL FOR ORDERED CATEGORIES WITH OPEN ENDS IN SQUARE CONTINGENCY TABLES REVSTAT Statistical Journal Volume 13, Number 3, November 2015, 233 243 MARGINAL HOMOGENEITY MODEL FOR ORDERED CATEGORIES WITH OPEN ENDS IN SQUARE CONTINGENCY TABLES Authors: Serpil Aktas Department of

More information

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a

More information

Minimum message length estimation of mixtures of multivariate Gaussian and von Mises-Fisher distributions

Minimum message length estimation of mixtures of multivariate Gaussian and von Mises-Fisher distributions Minimum message length estimation of mixtures of multivariate Gaussian and von Mises-Fisher distributions Parthan Kasarapu & Lloyd Allison Monash University, Australia September 8, 25 Parthan Kasarapu

More information

ScienceDirect. Defining Measures for Location Visiting Preference

ScienceDirect. Defining Measures for Location Visiting Preference Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 63 (2015 ) 142 147 6th International Conference on Emerging Ubiquitous Systems and Pervasive Networks, EUSPN-2015 Defining

More information

Lossless Online Bayesian Bagging

Lossless Online Bayesian Bagging Lossless Online Bayesian Bagging Herbert K. H. Lee ISDS Duke University Box 90251 Durham, NC 27708 herbie@isds.duke.edu Merlise A. Clyde ISDS Duke University Box 90251 Durham, NC 27708 clyde@isds.duke.edu

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

PIRLS 2016 Achievement Scaling Methodology 1

PIRLS 2016 Achievement Scaling Methodology 1 CHAPTER 11 PIRLS 2016 Achievement Scaling Methodology 1 The PIRLS approach to scaling the achievement data, based on item response theory (IRT) scaling with marginal estimation, was developed originally

More information

Aijun An and Nick Cercone. Department of Computer Science, University of Waterloo. methods in a context of learning classication rules.

Aijun An and Nick Cercone. Department of Computer Science, University of Waterloo. methods in a context of learning classication rules. Discretization of Continuous Attributes for Learning Classication Rules Aijun An and Nick Cercone Department of Computer Science, University of Waterloo Waterloo, Ontario N2L 3G1 Canada Abstract. We present

More information

PAIRED COMPARISONS MODELS AND APPLICATIONS. Regina Dittrich Reinhold Hatzinger Walter Katzenbeisser

PAIRED COMPARISONS MODELS AND APPLICATIONS. Regina Dittrich Reinhold Hatzinger Walter Katzenbeisser PAIRED COMPARISONS MODELS AND APPLICATIONS Regina Dittrich Reinhold Hatzinger Walter Katzenbeisser PAIRED COMPARISONS (Dittrich, Hatzinger, Katzenbeisser) WU Wien 6.11.2003 1 PAIRED COMPARISONS (PC) a

More information

Markov Chain Monte Carlo methods

Markov Chain Monte Carlo methods Markov Chain Monte Carlo methods By Oleg Makhnin 1 Introduction a b c M = d e f g h i 0 f(x)dx 1.1 Motivation 1.1.1 Just here Supresses numbering 1.1.2 After this 1.2 Literature 2 Method 2.1 New math As

More information

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions

More information

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan Clustering CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Supervised vs Unsupervised Learning Supervised learning Given x ", y " "%& ', learn a function f: X Y Categorical output classification

More information

More on Unsupervised Learning

More on Unsupervised Learning More on Unsupervised Learning Two types of problems are to find association rules for occurrences in common in observations (market basket analysis), and finding the groups of values of observational data

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

Decision Tree Learning Lecture 2

Decision Tree Learning Lecture 2 Machine Learning Coms-4771 Decision Tree Learning Lecture 2 January 28, 2008 Two Types of Supervised Learning Problems (recap) Feature (input) space X, label (output) space Y. Unknown distribution D over

More information

Analysis of Multinomial Response Data: a Measure for Evaluating Knowledge Structures

Analysis of Multinomial Response Data: a Measure for Evaluating Knowledge Structures Analysis of Multinomial Response Data: a Measure for Evaluating Knowledge Structures Department of Psychology University of Graz Universitätsplatz 2/III A-8010 Graz, Austria (e-mail: ali.uenlue@uni-graz.at)

More information

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines Nonlinear Support Vector Machines through Iterative Majorization and I-Splines P.J.F. Groenen G. Nalbantov J.C. Bioch July 9, 26 Econometric Institute Report EI 26-25 Abstract To minimize the primal support

More information

Generative Clustering, Topic Modeling, & Bayesian Inference

Generative Clustering, Topic Modeling, & Bayesian Inference Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Collaborative topic models: motivations cont

Collaborative topic models: motivations cont Collaborative topic models: motivations cont Two topics: machine learning social network analysis Two people: " boy Two articles: article A! girl article B Preferences: The boy likes A and B --- no problem.

More information

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2 STATS 306B: Unsupervised Learning Spring 2014 Lecture 2 April 2 Lecturer: Lester Mackey Scribe: Junyang Qian, Minzhe Wang 2.1 Recap In the last lecture, we formulated our working definition of unsupervised

More information

Mathematics for large scale tensor computations

Mathematics for large scale tensor computations Mathematics for large scale tensor computations José M. Martín-García Institut d Astrophysique de Paris, Laboratoire Universe et Théories Meudon, October 2, 2009 What we consider Tensors in a single vector

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows Kn-Nearest

More information

Outline Introduction OLS Design of experiments Regression. Metamodeling. ME598/494 Lecture. Max Yi Ren

Outline Introduction OLS Design of experiments Regression. Metamodeling. ME598/494 Lecture. Max Yi Ren 1 / 34 Metamodeling ME598/494 Lecture Max Yi Ren Department of Mechanical Engineering, Arizona State University March 1, 2015 2 / 34 1. preliminaries 1.1 motivation 1.2 ordinary least square 1.3 information

More information

Model Estimation Example

Model Estimation Example Ronald H. Heck 1 EDEP 606: Multivariate Methods (S2013) April 7, 2013 Model Estimation Example As we have moved through the course this semester, we have encountered the concept of model estimation. Discussions

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Investigation into the use of confidence indicators with calibration

Investigation into the use of confidence indicators with calibration WORKSHOP ON FRONTIERS IN BENCHMARKING TECHNIQUES AND THEIR APPLICATION TO OFFICIAL STATISTICS 7 8 APRIL 2005 Investigation into the use of confidence indicators with calibration Gerard Keogh and Dave Jennings

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Optimal Blocking by Minimizing the Maximum Within-Block Distance

Optimal Blocking by Minimizing the Maximum Within-Block Distance Optimal Blocking by Minimizing the Maximum Within-Block Distance Michael J. Higgins Jasjeet Sekhon Princeton University University of California at Berkeley November 14, 2013 For the Kansas State University

More information

EM (cont.) November 26 th, Carlos Guestrin 1

EM (cont.) November 26 th, Carlos Guestrin 1 EM (cont.) Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University November 26 th, 2007 1 Silly Example Let events be grades in a class w 1 = Gets an A P(A) = ½ w 2 = Gets a B P(B) = µ

More information

Structure learning in human causal induction

Structure learning in human causal induction Structure learning in human causal induction Joshua B. Tenenbaum & Thomas L. Griffiths Department of Psychology Stanford University, Stanford, CA 94305 jbt,gruffydd @psych.stanford.edu Abstract We use

More information

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for Signal Processing Bayes Classification and Regression Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For

More information

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016 AM 1: Advanced Optimization Spring 016 Prof. Yaron Singer Lecture 11 March 3rd 1 Overview In this lecture we will introduce the notion of online convex optimization. This is an extremely useful framework

More information

Different points of view for selecting a latent structure model

Different points of view for selecting a latent structure model Different points of view for selecting a latent structure model Gilles Celeux Inria Saclay-Île-de-France, Université Paris-Sud Latent structure models: two different point of views Density estimation LSM

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Mixtures of Gaussians with Sparse Regression Matrices. Constantinos Boulis, Jeffrey Bilmes

Mixtures of Gaussians with Sparse Regression Matrices. Constantinos Boulis, Jeffrey Bilmes Mixtures of Gaussians with Sparse Regression Matrices Constantinos Boulis, Jeffrey Bilmes {boulis,bilmes}@ee.washington.edu Dept of EE, University of Washington Seattle WA, 98195-2500 UW Electrical Engineering

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a

More information

Choosing the Summary Statistics and the Acceptance Rate in Approximate Bayesian Computation

Choosing the Summary Statistics and the Acceptance Rate in Approximate Bayesian Computation Choosing the Summary Statistics and the Acceptance Rate in Approximate Bayesian Computation COMPSTAT 2010 Revised version; August 13, 2010 Michael G.B. Blum 1 Laboratoire TIMC-IMAG, CNRS, UJF Grenoble

More information

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification 10-810: Advanced Algorithms and Models for Computational Biology Optimal leaf ordering and classification Hierarchical clustering As we mentioned, its one of the most popular methods for clustering gene

More information

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling 10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel

More information

Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood

Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood Jonathan Gruhl March 18, 2010 1 Introduction Researchers commonly apply item response theory (IRT) models to binary and ordinal

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University August 30, 2017 Today: Decision trees Overfitting The Big Picture Coming soon Probabilistic learning MLE,

More information

On the Problem of Error Propagation in Classifier Chains for Multi-Label Classification

On the Problem of Error Propagation in Classifier Chains for Multi-Label Classification On the Problem of Error Propagation in Classifier Chains for Multi-Label Classification Robin Senge, Juan José del Coz and Eyke Hüllermeier Draft version of a paper to appear in: L. Schmidt-Thieme and

More information

Probabilistic Time Series Classification

Probabilistic Time Series Classification Probabilistic Time Series Classification Y. Cem Sübakan Boğaziçi University 25.06.2013 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 1 / 54 Problem Statement The goal is to assign

More information

Variable selection for model-based clustering

Variable selection for model-based clustering Variable selection for model-based clustering Matthieu Marbac (Ensai - Crest) Joint works with: M. Sedki (Univ. Paris-sud) and V. Vandewalle (Univ. Lille 2) The problem Objective: Estimation of a partition

More information

Comparison between conditional and marginal maximum likelihood for a class of item response models

Comparison between conditional and marginal maximum likelihood for a class of item response models (1/24) Comparison between conditional and marginal maximum likelihood for a class of item response models Francesco Bartolucci, University of Perugia (IT) Silvia Bacci, University of Perugia (IT) Claudia

More information

Model Complexity of Pseudo-independent Models

Model Complexity of Pseudo-independent Models Model Complexity of Pseudo-independent Models Jae-Hyuck Lee and Yang Xiang Department of Computing and Information Science University of Guelph, Guelph, Canada {jaehyuck, yxiang}@cis.uoguelph,ca Abstract

More information

Statistical Analysis of List Experiments

Statistical Analysis of List Experiments Statistical Analysis of List Experiments Graeme Blair Kosuke Imai Princeton University December 17, 2010 Blair and Imai (Princeton) List Experiments Political Methodology Seminar 1 / 32 Motivation Surveys

More information

Statistical Practice. Selecting the Best Linear Mixed Model Under REML. Matthew J. GURKA

Statistical Practice. Selecting the Best Linear Mixed Model Under REML. Matthew J. GURKA Matthew J. GURKA Statistical Practice Selecting the Best Linear Mixed Model Under REML Restricted maximum likelihood (REML) estimation of the parameters of the mixed model has become commonplace, even

More information

arxiv: v1 [math.ra] 11 Aug 2010

arxiv: v1 [math.ra] 11 Aug 2010 ON THE DEFINITION OF QUASI-JORDAN ALGEBRA arxiv:1008.2009v1 [math.ra] 11 Aug 2010 MURRAY R. BREMNER Abstract. Velásquez and Felipe recently introduced quasi-jordan algebras based on the product a b = 1

More information

18.9 SUPPORT VECTOR MACHINES

18.9 SUPPORT VECTOR MACHINES 744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the

More information

A class of latent marginal models for capture-recapture data with continuous covariates

A class of latent marginal models for capture-recapture data with continuous covariates A class of latent marginal models for capture-recapture data with continuous covariates F Bartolucci A Forcina Università di Urbino Università di Perugia FrancescoBartolucci@uniurbit forcina@statunipgit

More information

Goodness-of-Fit Tests for the Ordinal Response Models with Misspecified Links

Goodness-of-Fit Tests for the Ordinal Response Models with Misspecified Links Communications of the Korean Statistical Society 2009, Vol 16, No 4, 697 705 Goodness-of-Fit Tests for the Ordinal Response Models with Misspecified Links Kwang Mo Jeong a, Hyun Yung Lee 1, a a Department

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

A Bayesian Criterion for Clustering Stability

A Bayesian Criterion for Clustering Stability A Bayesian Criterion for Clustering Stability B. Clarke 1 1 Dept of Medicine, CCS, DEPH University of Miami Joint with H. Koepke, Stat. Dept., U Washington 26 June 2012 ISBA Kyoto Outline 1 Assessing Stability

More information

Combining Predictions in Pairwise Classification: An Optimal Adaptive Voting Strategy and Its Relation to Weighted Voting

Combining Predictions in Pairwise Classification: An Optimal Adaptive Voting Strategy and Its Relation to Weighted Voting Combining Predictions in Pairwise Classification: An Optimal Adaptive Voting Strategy and Its Relation to Weighted Voting Draft of a paper to appear in Pattern Recognition Eyke Hüllermeier a,, Stijn Vanderlooy

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information