Clustering using Unsupervised Binary Trees: CUBT

Size: px
Start display at page:

Download "Clustering using Unsupervised Binary Trees: CUBT"

Transcription

1 Clustering using Unsupervised Binary Trees: CUBT arxiv: v1 [stat.me] 11 Nov 2010 Ricardo Fraiman Universidad de San Andrés Badih Ghattas Université de la Méditerrannée Marcela Svarc Universidad de San Andrés November 12, 2010 Abstract We introduce a new clustering method based on unsupervised binary trees. It is a three stages procedure, which performs on a first stage recursive binary splits reducing the heterogeneity of the data within the new subsamples. On the second stage (pruning) adjacent nodes are considered to be aggregated. Finally, on the third stage (joining) similar clusters are joined even if they do not descend from the same node. Consistency results are obtained and the procedure is tested on simulated and real data sets. 1 Introduction Clustering is a method of unsupervised classification and a common technique for statistical data analysis with applications in many fields, such as medicine, marketing, economics among others. The term cluster analysis (first used by Tryon, 1939) includes a number of different algorithms and methods for grouping similar data into respective categories. The grouping is build up in a way that the degree of association between data is maximal if they belong to the same group and minimal otherwise. Cluster analysis or clustering is the assignment of a set of observations from R p into subsets (called clusters) such that observations in the same clus- 1

2 ter are similar in some sense. These definitions are quite vague, since there is not a clear population objective function to measure the performance of a clustering procedure. Each clustering algorithm has implicitly an objective function which varies from one method to another. It is important to notice that even though most of the clustering procedures require the number of clusters beforehand in practice this information is usually unknown. On the contrary, in supervised classification the number of groups is known and we have in addition a learning sample and a universal objective function: to minimize the number of miss-classifications, or in population terms to minimize Bayes error. Despite these facts there are many similarities between supervised and unsupervised classification. Specifically, there are many algorithms that share the same spirit for both problems. Supervised and unsupervised classification (Hastie, 2003) algorithms have two main branches: algorithms can be partitional or hierarchical. Partitional algorithms determine all groups at once. The most popular and studied partitioning procedure for cluster analysis is k-means. Hierarchical algorithms find successively groups splitting or joining previously established groups. These algorithms can be either agglomerative ( bottom-up ) or divisive ( top-down ). Agglomerative algorithms begin with each element as a separate group and merge them into successively larger groups. Divisive algorithms begin with the whole set and proceed to split it into successively smaller groups. Hierarchical algorithms create a hierarchy of partitions which may be represented in a tree structure. The best known hierarchical algorithm for supervised classification is CART (Breiman et al., 1984). CART has a relevant additional property. The partition tree is built up based on few binary conditions on the original coordinates variables of the data. In most cases, the interpretation of the results is summarized in a tree that has a very simple structure. The usefulness of such classification scheme is valuable not only for rapid classification of new observations, but also, it can often yield a much simpler model for explaining why the observations are classified in a particular group, and this fact is remarkably important in many applications. Moreover, it is important to highlight that the algorithm does not assume any kind of parametric model on the underlying distribution. Several different methods to obtain clusters based on decision trees have been already introduced. Liu el at. (2000) use decision trees to partition the data space into clusters and empty (sparse) regions at different levels of details. The method is based on the idea of adding an artificial sample of size N uniformly distributed on the space. With these N points added to the original data set, the problem is to obtain a partition of the space into dense and sparse regions. They treat it as a classification problem using a 2

3 new purity function adapted to the problem based on the relative density among regions. Chavent et al. (1999) obtain a binary clustering tree based on a particular variable and its binary transformation. They present two different procedures. In the first one the splitting variables are recursively selected using correspondence analysis and the factorial scores lead to the binary transformation. On the second one the candidate variables and their variables transformations are simultaneously selected by an optimization criterion which evaluates the resulting partitions. Basak et al. (2005) propose four different measures for selecting the most appropriate features for splitting the data at every node and two algorithms for partitioning the data at every decision node. Specifically for categorical data Andreopoulus et al. (2007) introduce HIERDENC which is an algorithm that searches dense subspaces on the cube distribution of the attributes values of the data set. Our objective is to propose a simple clustering procedure sharing the appealing properties of CART. We introduce a hierarchical top-down method called CUBT (Clustering using unsupervised binary trees), where the clustering tree is based on binary rules on the original variables, which will help to understand the clustering structure. One main difference with the most popular hierarchical algorithms, such as single linkage or complete linkage, is that our clustering rule has a predicting property since allows to classify a new observation. The procedure is done in three stages. The first one grows a maximal tree by applying a recursive partitioning algorithm. The second one prunes the tree using a minimal dissimilarity criterion. Finally, the third one aggregates leaves of the tree which do not necessarily share the same direct ascendant. The paper is organized as follows. In Section 2 we introduce some notation and we describe the empirical and population versions of our method. The latter describes the method in terms of the population, regarded as a random vector X R p with unknown distribution P. The consistency of our method is shown in Section 3. In Section 4 we present the results of a simulation study where we challenge our method considering several different models and compare it with k means. We also compare on a synthetic data set the tree structure provided by CART (using the training sample) and CUBT considering the same sample without the labels. A real data example is analyzed in Section 5. Concluding remarks are given in Section 6. All proofs are given in the Appendix. 3

4 2 Clustering a la CART We start by fixing some notations, let X R p be a random p-dimensional real vector with coordinates X(j), j = 1,..., p, such that E( X 2 ) <. The data are constituted by n random independent and identically distributed realizations of X, Ξ = {X 1,..., X n }. For the population version the space is R p, while for the empirical version the space is Ξ. We denote by t to the nodes of the tree. Each node t of the tree determines a subset of R p which will be also denoted as t R p. To the root node we assign the whole space. Even though our procedure shares in many aspects the spirit of CART two main differences should be pointed out. First, as we are in the case of unsupervised classification, only the information of the observations without labels is available. Thus the splitting criterion cannot be based on the labels as in CART. The other essential difference is that instead of having one final pruning stage our algorithm has two phases, first we prune the tree and then there is a final joining stage. The former procedure evaluates the merging of adjacent nodes and the latter one aims to aggregate similar clusters that do not share the same direct ascendant in the tree. 2.1 Forward step: maximal tree construction As we are defining a top-down procedure we begin by assigning the whole space to the root node. Let t be a node and ˆt = Ξ t the set of observations coming from the sample at hand. At each stage a terminal node is considered to be split in two subnodes, the left and right child, t l, t r, if it fulfills a condition. At the begging there is only one node, the root, which contains the whole space. The splitting rule has the form x(j) < a, where x(j) is a variable and a is a threshold level. Thus, t l = {x R p : x(j) a} and t r = {x R p : x(j) > a}. Let X t be the restriction of X to the node t, i.e. X t = X {X t}, and α t the probability of being in t, α t = P (X t). Then R(t) is an heterogeneity measure of t defined by, R(t) = α t trace(cov(x t )), (1) where, cov(x t ) is the covariance matrix of X t. Thus, R(t) roughly measures the mass concentration of the random vector X at the set t weighted by the 4

5 mass of the set t. In the empirical algorithm α t and Cov(X t ) are replaced by their empirical versions (estimates), and R(t) is called the deviance. Then, we denote n t the cardinal of the set t, n t = n i=1 I {X i t}, (where I A stands for the indicator function of set A) hence the estimated probability is α t = nt and the estimate of E ( X n t µ t 2) is X i X t 2 {X i t} n t, (2) where X t is the empirical mean of the observations on t. The best split for t is defined as the couple (j, a) {1,..., p} R, (the first element indicates the variable where the partition is defined and the second element is the threshold level) that maximizes, (t, j, a) = R(t) R(t l ) R(t r ). It is easy to verify that (t, j, a) 0 for every t, j, a, this property is also verified by all the splitting criteria proposed in CART. We start with the whole space assigned to the root node, then each node is split recursively until one of the following stopping rules is satisfied: All the observations within a node are the same. There are less than minsize observations in a node. The deviance s reduction is less than mindev R(S), where S is the whole space. minsize and mindev are tuning parameters that must be supplied by the user, we consider minsize = 5 and mindev = Once the algorithm stops a label is assigned to each leaf (terminal node); we call this tree the maximal tree. We consider the standard criteria for enumerating the nodes on a complete binary tree, let t be the node n, the left child is numbered 2n and the right one 2n + 1. At this point, we have obtained a partition of the space and consequently a partition of the data set, where each leaf is associated to a cluster. Ideally, this tree has at least the same number of clusters as the population, in practice this tree may have too many clusters, then an agglomerative stage must be applied as in CART. It is important to remark that even though the number of clusters will be known beforehand, it is not needed on this stage and that small values of mindev ensure a tree with many leaves. Moreover, if the tree has the same 5

6 Model M1: minsize=10, minsplit=25, mindev=0.01 x1 < 2.72 x1 < x2 < x2 < x2 < x1 < x2 < x1 < x1 < x2 < x1 < 4.979x1 < x1 < Figure 1: Illustration of maximal tree for Model M1 simulated data. number of leaves as the number of clusters then is not necessary to run the subsequent stages of the algorithm. An example of a maximal tree is given on Figure 1 for a simulated data set defined in section 4, model M1. The maximal tree has fourteen leaves while model M1 has a three groups structure, the number of observations that belong to each terminal node is indicated. 2.2 Backward step: pruning and joining In this step we use successively two algorithms to give the final grouping conformation: the first one prunes the tree and the second one merges non adjacent leaves, we call it joining. We introduce a pruning criterion named minimum dissimilarity pruning Minimum dissimilarity pruning On this stage we define a measure of dissimilarities between sibling nodes and collapse them if this measure is lower than a threshold. First, we consider 6

7 the maximal tree T 0 obtained on the previous stage. Let t l and t r be a pair of terminal nodes sharing the same direct ascendant. Next, define (in populations term) the random variables W lr = D(X tl, X tr ), as the Euclidian distances between the random elements of X tl and of X tr. Finally define the dissimilarity measure between the sets t l and t r as, lr = δ 0 q α (W lr )dα, where q α (.) stands for the quantile function and δ is a proportion, δ (0, 1). If lr < ɛ, we prune the tree, i.e. we replace t l and t r by t l t r in the partition. It is worth to acknowledge that lr is just a more resistant version of the distance among the supports of the random vectors X tl and X tr. The dissimilarity measure can be estimated as follows, let n l (resp. n r ) be the size of ˆt l (resp. ˆt r ). Consider for every x i ˆt l and y j ˆt r, the sequences, d l = min y ˆt l d(x i, y), dr = min x ˆt r d(x, y j ) and their ordered versions denoted d l and d r. For δ [0, 1], let d δ l = 1 δn i δn i 1 d l, dδ r = 1 δn j δn j d r. We compute the dissimilarity between t l and t r as, d δ (l, r) = d δ (t l, t r ) = max( d δ l, d δ r), and on each step of the algorithm the leaves, t l and t r, are merged into the ascendant node t if d δ (l, r) ɛ where ɛ > 0. The dissimilarity pruning depends on two parameters: δ and ɛ, which from now on we call mindist Joining The idea of the joining step is to aggregate nodes which do not necessarily share the same direct ascendant. It is based on the relative decrease of the deviance when two nodes are aggregated. For any tree T we define its deviance as, R(T ) = 1 R(t), n 7 t T 1

8 where T is the set of the leaves of T. Our goal is to find an optimal subtree of T. Here all pairs of non terminal nodes t i and t j are compared computing, d ij = R(t i t j ) R(t i ) R(t j ). R(t i t j ) For the empirical version we consider a plug in estimate for d ij following the definitions provided on Section 2.1. As in standard hierarchical clustering procedures pairs of terminal nodes are successively aggregated in decreasing order of d ij ( ˆd ij respectively), then on each step there is one cluster less. We may consider two different stopping rules for the joining procedure which correspond either to the case where the number of clusters k is known or to the case where k is unknown. If k is known repeat the following step until m k: For each pair of values (i, j), 1 i < j m, let (ĩ, j) = arg min i,j {d ij }. Replace tĩ and t j by its union t ĩ t j, put m = m 1 and proceed ongoing. If k is unknown, if dĩ j < η replace t ĩ and t j by its union t ĩ t j, where η > 0 is a given constant, continue until this condition it is not fulfilled. In the first case the stopping criterion is the number of clusters while in the second case a threshold η for d ij must be settled. 2.3 CUBT and k-means In this section we discuss someway informally when our procedure and the very well known k means algorithm should produce a reasonably good output. We shall consider those cases where there are nice groups strictly separated. More precisely, let A 1,..., A k be disjoint connected compact sets on R p such that A i = A 0 i for i = 1,..., k, and {P i : i = 1,..., k} their probability measures on R p with supports {A i : i = 1,..., k}. A typical case is obtained defining a random vector X with density f and then considering the random vector X = X {f > δ} for a positive level set δ, as in several hierarchical clustering procedures. On the one hand, an admissible family for CUBT will be a family of sets A 1,..., A k such that there exist another family of disjoint sets B 1,..., B k built up as the intersection of a finite number of half spaces delimited by hyperplanes which are orthogonal to the coordinate axis satisfying A i B i. 8

9 On the other hand, k-means is defined through the vector of centers (c 1..., c k ) minimizing ( ) E min X c j, j=1,...,k associated with each center c j is the convex polyhedron S j of all points in R p closer to c j than to any other center, called the Voronoi cell of c j. The sets in the partition S 1,..., S k are the population clusters for k means. Therefore, the population clusters for k means are defined by exactly k hyperplanes in an arbitrary position. Then, an admissible family for k-means will be a family of sets A 1,..., A k that can be separated by exactly k hyperplanes. Even though the hyperplanes for k-means can be in general position, one cannot use more than k of them. It is clear that in this sense CUBT is much more flexible than k means, since the family of admissible sets is more general. For instance, k-means will necessarily fail to identify nested groups, while this will not be the case of CUBT. Another important difference between k means and CUBT is that our proposal is less sensitive to small changes in the parameters that define the partition. Effectively, small changes on them will produce small changes on the partition. However, small changes on the centers (c 1..., c k ) defining the k-means partition can produce significant changes in the associated partition given by the Voronoi cells. 3 Consistency of CUBT In this section we give some theoretical results about the consistency of our algorithm. First we prove an important property, the monotonicity of the deviance when the tree size increases. A simple equivalent characterization of the function R(t) is given in the following Lemma. Lemma 3.1. Let t l and t r be disjoint compact sets on R p and denote by µ s = E(X ts ), s = l, r respectively. If t = t l t r we have that, R(t) = R(t l ) + R(t r ) + α t l α tr α t µ l µ r 2. (3) The proof is given in the Appendix. 9

10 Remark 3.1. Monotonicity of the function R(.) and geometric interpretation. Observe that Lemma (3.1) entails that for all disjoint compact sets t l, t r and t = t l t r, the function R(.) is monotonic in the sense that, R(t) R(t l ) + R(t r ). (4) Moreover, R(t) will be close to R(t l ) + R(t r ) when the last term on the right hand side of (3) is small. This will happen either if one of the sets t l, t r has a very small fraction of the mass of t and/or if the centers of the two subsets t l, t r are very close. In either cases we will not want to split the set t. The following results show the consistency of the empirical algorithm to its population version. We begin with the splitting algorithm and follow then with the pruning and joining. Theorem 3.1. Assume that the random vector X has distribution P and a density f fulfilling that x 2 f(x) is bounded. Let X 1,..., X n be iid random vectors with the same distribution as X and denote by P n the empirical distribution of the sample X 1,..., X n. Let {t 1n,..., t mnn} be the empirical binary partition obtained by the forward empirical algorithm, and {t 1,..., t m } the population version. Then, we have that m n = m ultimately and each pair (i jn, a jn ) {1,..., p} R determining the empirical partition converges a.s. to the corresponding one (i j, a j ) {1,..., p} R for the population values. In particular, it implies that, lim n m P (t in t i ) = 0, i=1 where stands for the symmetric difference. The proof is given in the Appendix. Theorem 3.2. Let {t 1n,..., t k nn } be the final empirical binary partition obtained after the forward and backward empirical algorithm and {t 1,..., t k } the population version. Under the assumptions of Theorem 3.1 we have that k n = k ultimately (k n = k n if k is known), and lim n k P (t in t i ) = 0. i=1 The proof is given in the Appendix. 10

11 4 Some experiments. In this Section, we present the results of a simulation study where we challenge our method for four different models. We compare these results with those of k-means. As it is well known the performance of k-means strongly depends on position of the initial centroids used to start the algorithm. Several proposals have been made to handle this effect (see Steinly, 2007). We follow the recommendations in this last reference, considering ten random initializations and keeping the one with minimum within-cluster sum of squares given by, ( n k ) X i c j 2 I {X i G j }, i=1 j=1 where G j is the j th group and c j is the corresponding center. We denote this version k-means(10). 4.1 Simulated data sets We consider four different models, with k = 3, 4, and 10 groups. The sample size of each group is n i = 100 for i = 1,..., k in all cases, then the total sample size is N = 100k data. Except in the last model where n i = 30 for i = 1,..., 10, then the sample size is N = 300. M1. Three groups in dimension 2. The data are generated according to the following distributions: N(µ 1, Σ), N((µ 2, Σ), N((µ 3, Σ), (5) with µ 1 = ( 4, 0), µ 2 = (0, 0), µ 3 = (5, 5) and Σ = ( The left panel of Figure 4.1 exhibits an example of data generated from model M1. M2. Four groups in dimension 2. The data are generated from : ). N(µ i, Σ), i = 1,..., 4 distributions with centers ( 1, 0), (1, 0), (0, 1), (0, 1) and covariance matrix Σ = σ 2 Id with σ = 0.11, 0.13,..., The right panel of Figure 4.1 gives an example of data generated from model M2 with σ =

12 Model M1 Model M2, sigma=0.15 X X X X1 Figure 2: Scatter plot corresponding to M1 (left) and M2 (right) M3. Three groups in dimension 10. The data are generated according to the distributions given in (5) with Σ = σ 2 Id, σ = 0.7, 0.75,..., 0.9, µ 1 = ( 2,..., 2), µ 2 = (0,..., 0), µ 3 = (3,..., 3). The left panel of Figure 3 shows an example of data generated from model M3 with σ = 0.8 projected over the two first dimensions. M4. Ten groups in dimension 5. The data are generated from N(µ i, Σ), i = 1,..., 10. The means of the first five groups µ i, are the vectors of the canonical basis e 1,..., e 5 respectively, while the five remaining groups centers are µ 5+i = e i, i = 1, In every case, the covariance matrix is Σ = σ 2 Id, σ = 0.11, 0.13,..., The right panel of Figure 3 gives an example of data generated from model M4 with σ = 0.19 projected on the two first dimensions. 4.2 Tuning the method We perform M = 100 replicates for each model and compare these results with those of k-means and k-means(10). Throughout the simulations k is assumed to be known. In order to perform CUBT we must fix the values for the parameters involved at each stage of the algorithm: For the maximal tree we use: minsize = 5 and mindev = For the pruning stage: mindist = 0.3, 0.5 and δ =

13 Model M3: sigma=0.8 Model M4: sigma=0.19 X X X X1 Figure 3: Two dimensional projection scatter plots corresponding to M3 (left) and M4 (right) For the joining stage, the value of k for each model has been stated previously. Since we are working with synthetical data sets, we know the actual label of each observation, then it is reasonable to measure the goodness of a partition by computing the number of misclassified observations, which is the analogous to the misclassification error for supervised classification procedures. Denote the original clusters r = 1,..., R and the predicted ones s = 1,..., S. Let y 1,...y n be the group label of each observation, and ŷ 1,...ŷ n the class label assigned by the clustering algorithm. Let Σ be the set of permutations over {1,..., S} and A the set of arrangements of S elements taken from {1,..., R}, then the misclassification error we use may be expressed as: MCE = Min σ Σ 1 n n i=1 1 {y i σ(ŷ i )} if S R 1 = Min n σ A n i=1 1 {σ(y i ) ŷ i } else If the number of clusters is large the assignment problem may be computed in polynomial time using Bipartite Matching and the Hungarian Method (Papadimitriou et al, 1982). It is important to remark that Equation (6) is given in a more general framework than we need because in our case as the (6) 13

14 Sigma (σ) CUBT (0.3) CUBT (0.5) k-means k-means(10) Model e Model e Model e e e 04 Model Table 1: Simulation results for models M1 to M4. number of cluster is known, then S is equal to R, nonetheless Equation (6) admits S different from R. 4.3 Results Table 4.3 shows the results for the four models. Except for Model 1, we varied the values of σ reflecting the level of overlapping between classes. We report the misclassification error obtained for each clustering procedure, CUBT considering mindist = 0.3 (CUBT (0.3)) and mindist = 0.5 (CUBT (0.5)), k-means and k-means(10). The results for the different values of mindist are practically the same and in almost every case the results for CUBT are between those of k-means and k-means(10). Moreover, it is important to notice that the average number of leaves for the maximal tree is much bigger than the actual number of clusters. For model M1 it equals 33.8, for M2 it ranges between 35.2 and 40.4, for M3 between 18.4 and 20.8 and for M4 between 45.2 and

15 4.4 A comparison between CART and CUBT We compare in a simple example the tree structure obtained by CART using the complete training sample (observations plus the group labels) versus the tree structure obtained by CUBT considering only the training sample without the labels. We generate three sub populations in a two-dimensional variable space. The underlying distribution of the vector X =(X 1, X 2 ) is a bivariate normal distribution where the variables X 1 and X 2 are independent and their distributions for teh three groups are given by X 1 N (0, 0.03), N (2, 0.03), N (1, 0.25), X 2 N (0, 0.25), N (1, 0.25), N (2.5, 0.03), The data are then rotated π/4. A difficulty of this problem is that the optimal partitions are not parallel to the axis. Figure 4 shows for a single sample of size 300 the partitions obtained by CART and by CUBT respectively. We performed 100 replicates and for each a training sample of size 300 is generated, where every group has the same size. We compute then the mean misclassification rate with respect to the true labels. For CUBT its value was 0.032, while for CART there are not classification errors since we use the sample for both purposes, growing the tree and classifying the observations. Moreover, in order to compare our procedure with the traditional CART classification method, we found the respective binary trees for CART and CUBT. The structure is very similar, both trees are presented in Figure 5. Notice that the structure is exactly the same for both of them, and that the different cutoff values in the different branches may be understood with the aid of Figure 4 which corresponds to the same data set. 5 A real data example. European Jobs The data set contains the percentage employed in different economical activities for several European countries during The categories are, agriculture (A), mining (M), manufacturing (MA), power supplies industries (P), construction (C), service industries (SI), finance (F), social and personal services (S), and transportation and communication(t). It is important to notice that these data where collected during the cold war. The aim is to allocate the observations in clusters, but the number of clusters is unknown, then we are going to study the data structure for several numbers of clusters. 15

16 Figure 4: Plot of the partitions of the space for a generated data set. The solid lines exhibit the partition for CUBT and the dashed lines the partition for CART We consider first a four groups structure. In this case only one variable - the percentage employed in agriculture - determines the tree structure. The four groups are given on Table 2 and the corresponding tree is plotted on the top panel of Figure 6. As it can be seen in the tree, the highest value of A corresponds to Turkey, which is an outlier an conforms a single observation cluster, it can be explained due to its social and territorial proximity to Africa. The countries conforming groups 2 and 3 are countries who were either under a communist government or going through difficult political situations, for instance Spain was leaving behind Franco s regime. The countries that belong to Group 2 were poorer than those on Group 3. Finally Group 4 has the lowest percentage of employment in agriculture and the countries conforming that group were the most developed and not controlled by a communist regime, with the exception of East Germany. If we consider a four group structure by k-means, Turkey and Ireland, each of them are isolated in one group, Greece, Portugal, Poland, Romania and Yugoslavia conform another group and the rest of the countries conform another cluster. If instead of four clusters we consider a five cluster structure we can see 16

17 Figure 5: Left: Tree corresponding to CUBT. Right: Tree considering CART. In both cases the left branches indicate the smaller value of the partitioning variable. Group 1 Group 2 Group 3 Group 4 Turkey Greece Ireland Belgium Poland Portugal Denmark Romania Spain France Yugoslavia Bulgaria W. Germany Czechoslovakia E. Germany Hungary USSR Italy Luxembourg Netherlands United Kingdom Austria Norway Sweden Switzerland Table 2: CUBT clustering structure for four groups. that the main difference is that Group 3 of the previous partition is divided into three groups (Groups 3, 4 and 5) and that the variables that explained those partitions are the percentage employed in mining and the percentage employed in agriculture. The third group of the previous partition remain 17

18 Figure 6: Tree structure considering four groups top and five groups bottom, in every case on the left branch we present the smaller values of the variable that is making the partition. stable and groups 1 and 2 of the previous partition collapsed on one group that we denoted Group 1. The percentage employed in agriculture was the variable that determined the partition between groups 1, 2 and the rest of them. If a five cluster structure is considered via k-means then Turkey and Ireland, each of them are isolated in one group, Greece, Portugal, Poland, Romania and Yugoslavia conform another group this is the same as in the four cluster structure. Switzerland and east and west Germany conform a new cluster. Group 1 Group 2 Group 3 Group 4 Group 5 Turkey Ireland Belgium W. Germany France Greece Portugal Denmark E. Germany Italy Poland Spain Netherlands Switzerland Luxembourg Romania Bulgaria United Kingdom Austria Yugoslavia Czechoslovakia Norway Hungary Sweden USSR Table 3: CUBT clustering structure for five groups. 18

19 6 Concluding Remarks A new clustering method, CUBT in the spirit of CART is presented, defining the clusters in terms of binary rules on the original variables. This approach shares with classification trees this nice property which is very important in many practical applications. As the tree structure is based on the original variables it helps to determine which variables are important to in the cluster conformation. Moreover, the tree allows to classify new observations. A binary tree is obtained in three stages. In the first stage the sample is split into two sub samples reducing the heterogeneity of the data within the new sub samples according to the objective function R( ). Then the procedure is applied recursively to each sub sample. In the second and third stages the maximal tree obtained at the first stage is pruned using two different criteria, one for the adjacent nodes and the other one for all the terminal nodes. The algorithm is simple and runs in a reasonable computing time. There are no restrictions on the dimension of the data. Our method is consistent under quite general assumptions and behaves quite well on the simulated examples that we have considered, as well as in a real data example. A robust version could be developed changing in the objective function given in (1), cov(x T, X T ) by a robust covariance functional robcov(x T, X T ) (see for instance Maronna et al. Chapter 6, for a review) and then proceed in the same way. However a detailed study of this alternative goes beyond the scope of this work. 7 Appendix 7.1 Proof of Lemma 3.1 Observe first that since t l and t r are disjoint, E(X tl t r ) = γµ l + (1 γ)µ r, where γ = P (X t l X t l t r ). Given j = 1,..., p, denote by M (j) 2i = t i x(j) 2 df (x), i = l, r, where F stands for the distribution function of the vector X. It follows easily that E(X tl t r (j) 2 ) = γm (j) 2l 19 + (1 γ)m (j) 2r,

20 and therefore var(x tl t r (j)) = γvar(x tr (j))+(1 γ)var(x tr (j))+γ(1 γ)(µ l (j) µ r (j)) 2. Finally summing up on j we get the desired result. 7.2 Proof of Theorem 3.1 Let T be the family of polygons in R d with faces orthogonal to the axes, and fix i {1,..., p} and t T. For a R denote by t l = {x t : x(i) a} and t r = t \ t l. Define r(t, i, a) = R(t) R(t l ) R(t r ), (7) and r n (t, i, a) = R n (t) R n (t l ) R n (t r ), (8) the corresponding empirical version. We start showing the uniform convergence sup a R sup r n (t, i, a) r(t, i, a) 0 a.s. (9) t T By Lemma 3.1, α t (t, i, a) = α tl α tr µ l (a) µ r (a) 2, (10) where α A = P (X A) and µ j (a) = E(X tj ), j = l, r. Then, the pairs (i jn, a jn ) and (i j, a j ) are the arguments that maximize the right hand of (10) with respect to the measures P n and P respectively. Observe that the right hand size of (10) equals α tr x t 2 dp (x) + α tl x 2 dp (x) 2 xdp (x), xdp (x). (11) l t r t l t r It order to prove (9) it suffices to show that: 1. sup a R sup t T P n (t j ) P (t j ) 0 a.s. j = l, r 2. sup a R sup t T t j x 2 dp n (x) t j x 2 dp (x) 0 a.s. j = l, r 3. sup a R sup t T t j x(i)dp n (x) t j x(i)dp (x) 0 a.s. j = l, r, i = 1,..., p. 20

21 Since T is a Vapnik Chervonenkis class, we have that (i) holds. Now observe that the conditions for uniform convergence over families of sets still hold if we are dealing with signed finite measures. Therefore if we consider the finite measure x 2 dp (x) and the finite signed measure given by x(i)dp (x) we also have that (ii) and (iii) holds. Since lim α t a l α tr µ l (a) µ r (a) 2 = lim α t a l α tr µ l (a) µ r (a) 2 = 0, we have that argmax a R r n (t, i, a) argmax a R r(t, i, a) a.s. At the first step of the algorithm, t = R p and we get that i n1 = i 1 for n large enough and a n1 a 1 a.s. For the next step we have that the empirical procedure will start working with t nl and t nr while the population algorithm will do it with t l and t r. However, we have that sup a R sup t T sup r n (t nj, i, a) r(t j, i, a) a R r n (t nj, i, a) r(t nj, i, a) + sup r(t nj, i, a) r(t j, i, a), (12) a R for j = l, r. We already know that the first term on the right hand side of (12) converges to zero almost surely. To show that the second term also converges to zero, it suffices to show that 1. sup a R P (t nj ) P (t j ) 0 a.s. j = l, r 2. sup a R t j x 2 dp (x) t nj x 2 dp (x) 0 a.s. j = l, r 3. sup a R t j x(i)dp (x) t nj x(i)dp (x) 0 a.s. j = l, r, i = 1,..., p, which follows from the assumption that x 2 f(x) is bounded. This concludes the proof. 7.3 Proof of Theorem 3.2 We need to show that we have consistency in both steps of the backward algorithm. (i) Convergence of the pruning step. Let {t 1n,..., t mn} the output of the forward algorithm. The pruning step partition of the algorithm converges to the corresponding population version from the conclusions of Theorem

22 the fact that the random variables W lr dl, d r are positive. the uniform convergence of the empirical quantile function to it s population version. and the Lebesgue dominated convergence Theorem. (ii) Convergence of the joining step. Let {t 1n,..., t mn } the output of the pruning algorithm. Since (i) and the conclusions of Theorem 3.1 hold, the proof will be complete if we show that for any pair t ln, t rn {t 1n,..., t mn }, converges almost surely to d ln,rn = R n(t ln t rn ) R n (t ln ) R n (t rn ) R n (t ln t rn ) d l,r = R(t l t r ) R(t l ) R(t r ) R(t l t r ) as n. Now observe that the following inequalities hold R n (t ln ) R(t l ) R n (t ln ) R(t ln ) + R(t ln ) R(t l ) sup R n (t) R(t) + R(t ln ) R(t l ). (13) t T Finally from a similar argument to that in Theorem 3.1 we obtain that lim R n(t sn ) = R(t s ), n a.s., for s = i, j and from which we derive which completes the proof. lim R n(t ln t rn ) = R(t l t r ), n lim d ln,rn = d l,r, n a.s., a.s., REFERENCES Andreopoulos, B. and An, A. and Wang, X., 2007, Hierarchical Density- Based Clustering of Categorical Data and a Simplification, Advances in Knowledge Discovery and Data Mining,

23 Basak, J. and Krishnapuram, R., Interpretable Hierarchical Clustering by Construction an Unsupervised Decision Tree, IEEE Transactions on Knowledge and Data Engineering, Blockeel, H. and De Raedt, L. and Ramon, J., Top-down induction of clustering trees, Proceedings of the 15th International Conference on Machine Learning, ed. Morgan Kaufmann, Breiman, L. and Friedman, J. and Stone, C. J. and Olshen, R. A., Classification and Regression Trees, ed. Chapman & Hall/CRC. Chavent, M. and Guinot, C. and Lechevallier, Y. and Tenenhaus, M., Méthodes Divisives de Classification et Segmentation Non Supervisée: Recherche d une Typologie de la Peau Humanine Saine, Rev. Statistique Appliquée, XLVII Hastie, T. and Tibshirani, R. and Friedman, J. H., The Elements of Statistical Learning, Springer. Liu, B. and Xia, Y. and Yu, P. S., Clustering through decision tree construction, CIKM 00: Proceedings of the ninth international conference on Information and knowledge management, ed. ACM,New York, NY, USA, Maronna, R. A and Martin, D. R. and Yohai, V. J., Robust Statistics: Theory and Methods (Wiley Series in Probability and Statistics), Wiley. Papadimitriou, C. and Steiglitz, K., Combinatorial Optimization: Algorithms and Complexity, Englewood Cliffs: Prentice Hall. F. Questier and R. Put and D. Coomans and B. Walczak and Y. Vander Heyden, The use of CART and multivariate regression trees for supervised and unsupervised feature selection, Chemometrics and Intelligent Laboratory Systems, Smyth C. and Coomans D. and Everingham Y. and Hancock T., Autoassociative Multivariate Regression Trees for Cluster Analysis, Chemometrics and Intelligent Laboratory Systems, Smyth C. and Coomans D. and Everingham Y., Clustering noisy data in a reduced dimension space via multivariate regression trees, Pattern Recognition, Steinly D. and Brusco M., Initializing K-means Batch Clustering: A Critical evaluation of Several Techniques, Journal of Classification, Tryon, R.C., Cluster Analysis, McGraw-Hill. 23

Multivariate Analysis

Multivariate Analysis Prof. Dr. J. Franke All of Statistics 3.1 Multivariate Analysis High dimensional data X 1,..., X N, i.i.d. random vectors in R p. As a data matrix X: objects values of p features 1 X 11 X 12... X 1p 2.

More information

Induction of Decision Trees

Induction of Decision Trees Induction of Decision Trees Peter Waiganjo Wagacha This notes are for ICS320 Foundations of Learning and Adaptive Systems Institute of Computer Science University of Nairobi PO Box 30197, 00200 Nairobi.

More information

CS 6375 Machine Learning

CS 6375 Machine Learning CS 6375 Machine Learning Decision Trees Instructor: Yang Liu 1 Supervised Classifier X 1 X 2. X M Ref class label 2 1 Three variables: Attribute 1: Hair = {blond, dark} Attribute 2: Height = {tall, short}

More information

STATISTICA MULTIVARIATA 2

STATISTICA MULTIVARIATA 2 1 / 73 STATISTICA MULTIVARIATA 2 Fabio Rapallo Dipartimento di Scienze e Innovazione Tecnologica Università del Piemonte Orientale, Alessandria (Italy) fabio.rapallo@uniupo.it Alessandria, May 2016 2 /

More information

Vapnik-Chervonenkis Dimension of Axis-Parallel Cuts arxiv: v2 [math.st] 23 Jul 2012

Vapnik-Chervonenkis Dimension of Axis-Parallel Cuts arxiv: v2 [math.st] 23 Jul 2012 Vapnik-Chervonenkis Dimension of Axis-Parallel Cuts arxiv:203.093v2 [math.st] 23 Jul 202 Servane Gey July 24, 202 Abstract The Vapnik-Chervonenkis (VC) dimension of the set of half-spaces of R d with frontiers

More information

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata Principles of Pattern Recognition C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata e-mail: murthy@isical.ac.in Pattern Recognition Measurement Space > Feature Space >Decision

More information

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan Clustering CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Supervised vs Unsupervised Learning Supervised learning Given x ", y " "%& ', learn a function f: X Y Categorical output classification

More information

Decision Tree Learning

Decision Tree Learning Decision Tree Learning Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Machine Learning, Chapter 3 2. Data Mining: Concepts, Models,

More information

Learning Decision Trees

Learning Decision Trees Learning Decision Trees CS194-10 Fall 2011 Lecture 8 CS194-10 Fall 2011 Lecture 8 1 Outline Decision tree models Tree construction Tree pruning Continuous input features CS194-10 Fall 2011 Lecture 8 2

More information

day month year documentname/initials 1

day month year documentname/initials 1 ECE471-571 Pattern Recognition Lecture 13 Decision Tree Hairong Qi, Gonzalez Family Professor Electrical Engineering and Computer Science University of Tennessee, Knoxville http://www.eecs.utk.edu/faculty/qi

More information

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Decision Trees Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s

More information

Weighted Voting Games

Weighted Voting Games Weighted Voting Games Gregor Schwarz Computational Social Choice Seminar WS 2015/2016 Technische Universität München 01.12.2015 Agenda 1 Motivation 2 Basic Definitions 3 Solution Concepts Core Shapley

More information

Solving Classification Problems By Knowledge Sets

Solving Classification Problems By Knowledge Sets Solving Classification Problems By Knowledge Sets Marcin Orchel a, a Department of Computer Science, AGH University of Science and Technology, Al. A. Mickiewicza 30, 30-059 Kraków, Poland Abstract We propose

More information

Pattern Recognition Approaches to Solving Combinatorial Problems in Free Groups

Pattern Recognition Approaches to Solving Combinatorial Problems in Free Groups Contemporary Mathematics Pattern Recognition Approaches to Solving Combinatorial Problems in Free Groups Robert M. Haralick, Alex D. Miasnikov, and Alexei G. Myasnikov Abstract. We review some basic methodologies

More information

Random Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes.

Random Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes. Random Forests One of the best known classifiers is the random forest. It is very simple and effective but there is still a large gap between theory and practice. Basically, a random forest is an average

More information

Generalization Error on Pruning Decision Trees

Generalization Error on Pruning Decision Trees Generalization Error on Pruning Decision Trees Ryan R. Rosario Computer Science 269 Fall 2010 A decision tree is a predictive model that can be used for either classification or regression [3]. Decision

More information

Unsupervised machine learning

Unsupervised machine learning Chapter 9 Unsupervised machine learning Unsupervised machine learning (a.k.a. cluster analysis) is a set of methods to assign objects into clusters under a predefined distance measure when class labels

More information

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro Decision Trees CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Classification without Models Well, partially without a model } Today: Decision Trees 2015 Bruno Ribeiro 2 3 Why Trees? } interpretable/intuitive,

More information

Influence measures for CART

Influence measures for CART Jean-Michel Poggi Orsay, Paris Sud & Paris Descartes Joint work with Avner Bar-Hen Servane Gey (MAP5, Paris Descartes ) CART CART Classification And Regression Trees, Breiman et al. (1984) Learning set

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 4: Vector Data: Decision Tree Instructor: Yizhou Sun yzsun@cs.ucla.edu October 10, 2017 Methods to Learn Vector Data Set Data Sequence Data Text Data Classification Clustering

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 8 Text Classification Introduction A Characterization of Text Classification Unsupervised Algorithms Supervised Algorithms Feature Selection or Dimensionality Reduction

More information

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition Ad Feelders Universiteit Utrecht Department of Information and Computing Sciences Algorithmic Data

More information

BINARY TREE-STRUCTURED PARTITION AND CLASSIFICATION SCHEMES

BINARY TREE-STRUCTURED PARTITION AND CLASSIFICATION SCHEMES BINARY TREE-STRUCTURED PARTITION AND CLASSIFICATION SCHEMES DAVID MCDIARMID Abstract Binary tree-structured partition and classification schemes are a class of nonparametric tree-based approaches to classification

More information

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018 Data Mining CS57300 Purdue University Bruno Ribeiro February 8, 2018 Decision trees Why Trees? interpretable/intuitive, popular in medical applications because they mimic the way a doctor thinks model

More information

Decision Tree Learning Lecture 2

Decision Tree Learning Lecture 2 Machine Learning Coms-4771 Decision Tree Learning Lecture 2 January 28, 2008 Two Types of Supervised Learning Problems (recap) Feature (input) space X, label (output) space Y. Unknown distribution D over

More information

Neural Networks and the Back-propagation Algorithm

Neural Networks and the Back-propagation Algorithm Neural Networks and the Back-propagation Algorithm Francisco S. Melo In these notes, we provide a brief overview of the main concepts concerning neural networks and the back-propagation algorithm. We closely

More information

Machine Learning & Data Mining

Machine Learning & Data Mining Group M L D Machine Learning M & Data Mining Chapter 7 Decision Trees Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University Top 10 Algorithm in DM #1: C4.5 #2: K-Means #3: SVM

More information

Classification Using Decision Trees

Classification Using Decision Trees Classification Using Decision Trees 1. Introduction Data mining term is mainly used for the specific set of six activities namely Classification, Estimation, Prediction, Affinity grouping or Association

More information

Dyadic Classification Trees via Structural Risk Minimization

Dyadic Classification Trees via Structural Risk Minimization Dyadic Classification Trees via Structural Risk Minimization Clayton Scott and Robert Nowak Department of Electrical and Computer Engineering Rice University Houston, TX 77005 cscott,nowak @rice.edu Abstract

More information

Machine Learning on temporal data

Machine Learning on temporal data Machine Learning on temporal data Classification rees for ime Series Ahlame Douzal (Ahlame.Douzal@imag.fr) AMA, LIG, Université Joseph Fourier Master 2R - MOSIG (2011) Plan ime Series classification approaches

More information

On improving matchings in trees, via bounded-length augmentations 1

On improving matchings in trees, via bounded-length augmentations 1 On improving matchings in trees, via bounded-length augmentations 1 Julien Bensmail a, Valentin Garnero a, Nicolas Nisse a a Université Côte d Azur, CNRS, Inria, I3S, France Abstract Due to a classical

More information

Classification 1: Linear regression of indicators, linear discriminant analysis

Classification 1: Linear regression of indicators, linear discriminant analysis Classification 1: Linear regression of indicators, linear discriminant analysis Ryan Tibshirani Data Mining: 36-462/36-662 April 2 2013 Optional reading: ISL 4.1, 4.2, 4.4, ESL 4.1 4.3 1 Classification

More information

A clustering view on ESS measures of political interest:

A clustering view on ESS measures of political interest: A clustering view on ESS measures of political interest: An EM-MML approach Cláudia Silvestre Margarida Cardoso Mário Figueiredo Escola Superior de Comunicação Social - IPL BRU_UNIDE, ISCTE-IUL Instituto

More information

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS Maya Gupta, Luca Cazzanti, and Santosh Srivastava University of Washington Dept. of Electrical Engineering Seattle,

More information

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines Nonlinear Support Vector Machines through Iterative Majorization and I-Splines P.J.F. Groenen G. Nalbantov J.C. Bioch July 9, 26 Econometric Institute Report EI 26-25 Abstract To minimize the primal support

More information

Linear Discrimination Functions

Linear Discrimination Functions Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach

More information

Statistics and learning: Big Data

Statistics and learning: Big Data Statistics and learning: Big Data Learning Decision Trees and an Introduction to Boosting Sébastien Gadat Toulouse School of Economics February 2017 S. Gadat (TSE) SAD 2013 1 / 30 Keywords Decision trees

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA

More information

WHO EpiData. A monthly summary of the epidemiological data on selected Vaccine preventable diseases in the WHO European Region

WHO EpiData. A monthly summary of the epidemiological data on selected Vaccine preventable diseases in the WHO European Region A monthly summary of the epidemiological data on selected Vaccine preventable diseases in the WHO European Region Table 1: Reported cases for the period January December 2018 (data as of 01 February 2019)

More information

Learning Decision Trees

Learning Decision Trees Learning Decision Trees Machine Learning Spring 2018 1 This lecture: Learning Decision Trees 1. Representation: What are decision trees? 2. Algorithm: Learning decision trees The ID3 algorithm: A greedy

More information

Refinement of the OECD regional typology: Economic Performance of Remote Rural Regions

Refinement of the OECD regional typology: Economic Performance of Remote Rural Regions [Preliminary draft April 2010] Refinement of the OECD regional typology: Economic Performance of Remote Rural Regions by Lewis Dijkstra* and Vicente Ruiz** Abstract To account for differences among rural

More information

ARTICLE IN PRESS. Journal of Multivariate Analysis ( ) Contents lists available at ScienceDirect. Journal of Multivariate Analysis

ARTICLE IN PRESS. Journal of Multivariate Analysis ( ) Contents lists available at ScienceDirect. Journal of Multivariate Analysis Journal of Multivariate Analysis ( ) Contents lists available at ScienceDirect Journal of Multivariate Analysis journal homepage: www.elsevier.com/locate/jmva Marginal parameterizations of discrete models

More information

Genetic Algorithms: Basic Principles and Applications

Genetic Algorithms: Basic Principles and Applications Genetic Algorithms: Basic Principles and Applications C. A. MURTHY MACHINE INTELLIGENCE UNIT INDIAN STATISTICAL INSTITUTE 203, B.T.ROAD KOLKATA-700108 e-mail: murthy@isical.ac.in Genetic algorithms (GAs)

More information

MUTUAL INFORMATION (MI) specifies the level of

MUTUAL INFORMATION (MI) specifies the level of IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 58, NO. 7, JULY 2010 3497 Nonproduct Data-Dependent Partitions for Mutual Information Estimation: Strong Consistency and Applications Jorge Silva, Member, IEEE,

More information

Learning Decision Trees

Learning Decision Trees Learning Decision Trees Machine Learning Fall 2018 Some slides from Tom Mitchell, Dan Roth and others 1 Key issues in machine learning Modeling How to formulate your problem as a machine learning problem?

More information

Classification of Ordinal Data Using Neural Networks

Classification of Ordinal Data Using Neural Networks Classification of Ordinal Data Using Neural Networks Joaquim Pinto da Costa and Jaime S. Cardoso 2 Faculdade Ciências Universidade Porto, Porto, Portugal jpcosta@fc.up.pt 2 Faculdade Engenharia Universidade

More information

From statistics to data science. BAE 815 (Fall 2017) Dr. Zifei Liu

From statistics to data science. BAE 815 (Fall 2017) Dr. Zifei Liu From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu Why? How? What? How much? How many? Individual facts (quantities, characters, or symbols) The Data-Information-Knowledge-Wisdom

More information

Statistical Consulting Topics Classification and Regression Trees (CART)

Statistical Consulting Topics Classification and Regression Trees (CART) Statistical Consulting Topics Classification and Regression Trees (CART) Suppose the main goal in a data analysis is the prediction of a categorical variable outcome. Such as in the examples below. Given

More information

Growing a Large Tree

Growing a Large Tree STAT 5703 Fall, 2004 Data Mining Methodology I Decision Tree I Growing a Large Tree Contents 1 A Single Split 2 1.1 Node Impurity.................................. 2 1.2 Computation of i(t)................................

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

A Simple Implementation of the Stochastic Discrimination for Pattern Recognition

A Simple Implementation of the Stochastic Discrimination for Pattern Recognition A Simple Implementation of the Stochastic Discrimination for Pattern Recognition Dechang Chen 1 and Xiuzhen Cheng 2 1 University of Wisconsin Green Bay, Green Bay, WI 54311, USA chend@uwgb.edu 2 University

More information

Variable Selection and Weighting by Nearest Neighbor Ensembles

Variable Selection and Weighting by Nearest Neighbor Ensembles Variable Selection and Weighting by Nearest Neighbor Ensembles Jan Gertheiss (joint work with Gerhard Tutz) Department of Statistics University of Munich WNI 2008 Nearest Neighbor Methods Introduction

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

Sample width for multi-category classifiers

Sample width for multi-category classifiers R u t c o r Research R e p o r t Sample width for multi-category classifiers Martin Anthony a Joel Ratsaby b RRR 29-2012, November 2012 RUTCOR Rutgers Center for Operations Research Rutgers University

More information

Tailored Bregman Ball Trees for Effective Nearest Neighbors

Tailored Bregman Ball Trees for Effective Nearest Neighbors Tailored Bregman Ball Trees for Effective Nearest Neighbors Frank Nielsen 1 Paolo Piro 2 Michel Barlaud 2 1 Ecole Polytechnique, LIX, Palaiseau, France 2 CNRS / University of Nice-Sophia Antipolis, Sophia

More information

Decision Tree Learning

Decision Tree Learning Decision Tree Learning Goals for the lecture you should understand the following concepts the decision tree representation the standard top-down approach to learning a tree Occam s razor entropy and information

More information

Lecture 4: Data preprocessing: Data Reduction-Discretization. Dr. Edgar Acuna. University of Puerto Rico- Mayaguez math.uprm.

Lecture 4: Data preprocessing: Data Reduction-Discretization. Dr. Edgar Acuna. University of Puerto Rico- Mayaguez math.uprm. COMP 6838: Data Mining Lecture 4: Data preprocessing: Data Reduction-Discretization Dr. Edgar Acuna Department t of Mathematics ti University of Puerto Rico- Mayaguez math.uprm.edu/~edgar 1 Discretization

More information

Small sample size generalization

Small sample size generalization 9th Scandinavian Conference on Image Analysis, June 6-9, 1995, Uppsala, Sweden, Preprint Small sample size generalization Robert P.W. Duin Pattern Recognition Group, Faculty of Applied Physics Delft University

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Notes on Discriminant Functions and Optimal Classification

Notes on Discriminant Functions and Optimal Classification Notes on Discriminant Functions and Optimal Classification Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Discriminant Functions Consider a classification problem

More information

k-protected VERTICES IN BINARY SEARCH TREES

k-protected VERTICES IN BINARY SEARCH TREES k-protected VERTICES IN BINARY SEARCH TREES MIKLÓS BÓNA Abstract. We show that for every k, the probability that a randomly selected vertex of a random binary search tree on n nodes is at distance k from

More information

Feature selection and extraction Spectral domain quality estimation Alternatives

Feature selection and extraction Spectral domain quality estimation Alternatives Feature selection and extraction Error estimation Maa-57.3210 Data Classification and Modelling in Remote Sensing Markus Törmä markus.torma@tkk.fi Measurements Preprocessing: Remove random and systematic

More information

More on Unsupervised Learning

More on Unsupervised Learning More on Unsupervised Learning Two types of problems are to find association rules for occurrences in common in observations (market basket analysis), and finding the groups of values of observational data

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

Trends in Human Development Index of European Union

Trends in Human Development Index of European Union Trends in Human Development Index of European Union Department of Statistics, Hacettepe University, Beytepe, Ankara, Turkey spxl@hacettepe.edu.tr, deryacal@hacettepe.edu.tr Abstract: The Human Development

More information

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees! Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees! Summary! Input Knowledge representation! Preparing data for learning! Input: Concept, Instances, Attributes"

More information

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis Week 5 Based in part on slides from textbook, slides of Susan Holmes Part I Linear Discriminant Analysis October 29, 2012 1 / 1 2 / 1 Nearest centroid rule Suppose we break down our data matrix as by the

More information

Variance estimation on SILC based indicators

Variance estimation on SILC based indicators Variance estimation on SILC based indicators Emilio Di Meglio Eurostat emilio.di-meglio@ec.europa.eu Guillaume Osier STATEC guillaume.osier@statec.etat.lu 3rd EU-LFS/EU-SILC European User Conference 1

More information

REGRESSION TREE CREDIBILITY MODEL

REGRESSION TREE CREDIBILITY MODEL LIQUN DIAO AND CHENGGUO WENG Department of Statistics and Actuarial Science, University of Waterloo Advances in Predictive Analytics Conference, Waterloo, Ontario Dec 1, 2017 Overview Statistical }{{ Method

More information

arxiv: v1 [cs.ds] 28 Sep 2018

arxiv: v1 [cs.ds] 28 Sep 2018 Minimization of Gini impurity via connections with the k-means problem arxiv:1810.00029v1 [cs.ds] 28 Sep 2018 Eduardo Laber PUC-Rio, Brazil laber@inf.puc-rio.br October 2, 2018 Abstract Lucas Murtinho

More information

Qualifying Exam in Machine Learning

Qualifying Exam in Machine Learning Qualifying Exam in Machine Learning October 20, 2009 Instructions: Answer two out of the three questions in Part 1. In addition, answer two out of three questions in two additional parts (choose two parts

More information

1 Handling of Continuous Attributes in C4.5. Algorithm

1 Handling of Continuous Attributes in C4.5. Algorithm .. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. Data Mining: Classification/Supervised Learning Potpourri Contents 1. C4.5. and continuous attributes: incorporating continuous

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18 Decision Tree Analysis for Classification Problems Entscheidungsunterstützungssysteme SS 18 Supervised segmentation An intuitive way of thinking about extracting patterns from data in a supervised manner

More information

CHAPTER-17. Decision Tree Induction

CHAPTER-17. Decision Tree Induction CHAPTER-17 Decision Tree Induction 17.1 Introduction 17.2 Attribute selection measure 17.3 Tree Pruning 17.4 Extracting Classification Rules from Decision Trees 17.5 Bayesian Classification 17.6 Bayes

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete

More information

A Markov system analysis application on labour market dynamics: The case of Greece

A Markov system analysis application on labour market dynamics: The case of Greece + A Markov system analysis application on labour market dynamics: The case of Greece Maria Symeonaki Glykeria Stamatopoulou This project has received funding from the European Union s Horizon 2020 research

More information

the tree till a class assignment is reached

the tree till a class assignment is reached Decision Trees Decision Tree for Playing Tennis Prediction is done by sending the example down Prediction is done by sending the example down the tree till a class assignment is reached Definitions Internal

More information

SUPPORT VECTOR MACHINE

SUPPORT VECTOR MACHINE SUPPORT VECTOR MACHINE Mainly based on https://nlp.stanford.edu/ir-book/pdf/15svm.pdf 1 Overview SVM is a huge topic Integration of MMDS, IIR, and Andrew Moore s slides here Our foci: Geometric intuition

More information

Lectures in AstroStatistics: Topics in Machine Learning for Astronomers

Lectures in AstroStatistics: Topics in Machine Learning for Astronomers Lectures in AstroStatistics: Topics in Machine Learning for Astronomers Jessi Cisewski Yale University American Astronomical Society Meeting Wednesday, January 6, 2016 1 Statistical Learning - learning

More information

Logistic Regression and Boosting for Labeled Bags of Instances

Logistic Regression and Boosting for Labeled Bags of Instances Logistic Regression and Boosting for Labeled Bags of Instances Xin Xu and Eibe Frank Department of Computer Science University of Waikato Hamilton, New Zealand {xx5, eibe}@cs.waikato.ac.nz Abstract. In

More information

Risk Bounds for CART Classifiers under a Margin Condition

Risk Bounds for CART Classifiers under a Margin Condition arxiv:0902.3130v5 stat.ml 1 Mar 2012 Risk Bounds for CART Classifiers under a Margin Condition Servane Gey March 2, 2012 Abstract Non asymptotic risk bounds for Classification And Regression Trees (CART)

More information

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler + Machine Learning and Data Mining Decision Trees Prof. Alexander Ihler Decision trees Func-onal form f(x;µ): nested if-then-else statements Discrete features: fully expressive (any func-on) Structure:

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Decision Trees Tobias Scheffer Decision Trees One of many applications: credit risk Employed longer than 3 months Positive credit

More information

Second Order Cone Programming, Missing or Uncertain Data, and Sparse SVMs

Second Order Cone Programming, Missing or Uncertain Data, and Sparse SVMs Second Order Cone Programming, Missing or Uncertain Data, and Sparse SVMs Ammon Washburn University of Arizona September 25, 2015 1 / 28 Introduction We will begin with basic Support Vector Machines (SVMs)

More information

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 42 Outline 1 Introduction 2 Feature selection

More information

Resampling Methods CAPT David Ruth, USN

Resampling Methods CAPT David Ruth, USN Resampling Methods CAPT David Ruth, USN Mathematics Department, United States Naval Academy Science of Test Workshop 05 April 2017 Outline Overview of resampling methods Bootstrapping Cross-validation

More information

Gravity Analysis of Regional Economic Interdependence: In case of Japan

Gravity Analysis of Regional Economic Interdependence: In case of Japan Prepared for the 21 st INFORUM World Conference 26-31 August 2013, Listvyanka, Russia Gravity Analysis of Regional Economic Interdependence: In case of Japan Toshiaki Hasegawa Chuo University Tokyo, JAPAN

More information

Hard and Fuzzy c-medoids for Asymmetric Networks

Hard and Fuzzy c-medoids for Asymmetric Networks 16th World Congress of the International Fuzzy Systems Association (IFSA) 9th Conference of the European Society for Fuzzy Logic and Technology (EUSFLAT) Hard and Fuzzy c-medoids for Asymmetric Networks

More information

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

ECE662: Pattern Recognition and Decision Making Processes: HW TWO ECE662: Pattern Recognition and Decision Making Processes: HW TWO Purdue University Department of Electrical and Computer Engineering West Lafayette, INDIANA, USA Abstract. In this report experiments are

More information

Lecture 7: DecisionTrees

Lecture 7: DecisionTrees Lecture 7: DecisionTrees What are decision trees? Brief interlude on information theory Decision tree construction Overfitting avoidance Regression trees COMP-652, Lecture 7 - September 28, 2009 1 Recall:

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction

More information

Kernel Logistic Regression and the Import Vector Machine

Kernel Logistic Regression and the Import Vector Machine Kernel Logistic Regression and the Import Vector Machine Ji Zhu and Trevor Hastie Journal of Computational and Graphical Statistics, 2005 Presented by Mingtao Ding Duke University December 8, 2011 Mingtao

More information

Decision trees COMS 4771

Decision trees COMS 4771 Decision trees COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples).

More information

Chapter 6: Classification

Chapter 6: Classification Chapter 6: Classification 1) Introduction Classification problem, evaluation of classifiers, prediction 2) Bayesian Classifiers Bayes classifier, naive Bayes classifier, applications 3) Linear discriminant

More information

Computing and using the deviance with classification trees

Computing and using the deviance with classification trees Computing and using the deviance with classification trees Gilbert Ritschard Dept of Econometrics, University of Geneva Compstat, Rome, August 2006 Outline 1 Introduction 2 Motivation 3 Deviance for Trees

More information

HAMILTON CYCLES IN RANDOM REGULAR DIGRAPHS

HAMILTON CYCLES IN RANDOM REGULAR DIGRAPHS HAMILTON CYCLES IN RANDOM REGULAR DIGRAPHS Colin Cooper School of Mathematical Sciences, Polytechnic of North London, London, U.K. and Alan Frieze and Michael Molloy Department of Mathematics, Carnegie-Mellon

More information

Prediction in bioinformatics applications by conformal predictors

Prediction in bioinformatics applications by conformal predictors Prediction in bioinformatics applications by conformal predictors Alex Gammerman (joint work with Ilia Nuoretdinov and Paolo Toccaceli) Computer Learning Research Centre Royal Holloway, University of London

More information

Chapter 14 Combining Models

Chapter 14 Combining Models Chapter 14 Combining Models T-61.62 Special Course II: Pattern Recognition and Machine Learning Spring 27 Laboratory of Computer and Information Science TKK April 3th 27 Outline Independent Mixing Coefficients

More information