More on Unsupervised Learning

Size: px

Start display at page:

Download "More on Unsupervised Learning"

Edwina Fields
5 years ago
Views:

1 More on Unsupervised Learning Two types of problems are to find association rules for occurrences in common in observations (market basket analysis), and finding the groups of values of observational data that are similar. For finding groups of similar observations, the main two approaches are tree-based methods (divisive or agglomerative) and use of distances to prototype points. 1

2 Association Rules: Review A problem related to clustering is market basket analysis. In this we seek to determine which items in a basket (or cluster) are likely to occur together. If A and B are items (or more generally, subsets of the possible values taken on by a given variable), we consider rules of the form A B. We call A the antecedent and B the consequent. The rule states that if A is present, B is likely to be present. 2

3 Association Rules The support of the rule is the fraction of all baskets in which A and B are present. We denote this as T(A B). The confidence of the rule or predictability of the association is the conditional fraction C(A B) = T(A B). T(A) Finally, the lift of the rule is the conditional fraction L(A B) = C(A B) T(B) = T(A B) T(A)T(B). 3

4 K-Means Clustering The clustering depends on the variability of the variables. It may be necessary to scale the variables in order for the clustering to be sensible because the larger a variable s variance, the more impact it will have on the clustering. K-means in R. The simple function is kmeans. The first argument is the data and the second is the number of means: kmeans(mydata,k) 4

5 Choosing the Number of Clusters A major issue is how many clusters should be formed. This question must generally be addressed in an ad hoc manner. A number of statistics have been proposed for use in deciding how many clusters to use. The Calinski-Harabasz index, b/(k 1) w/(n k), where b is the between-groups sum-of-squares, b = k m g=1 j=1 ( x j(g) x j ) 2, and w is the pooled within-groups sum-of-squares, can be used as a stopping rule. The objective is to maximize it. 5

6 Variations on K-Means K-means, as we have described it, is appropriate for continuous data in which the natural measure of distance is Euclidean. There are many variations on K-means clustering. One, of course is just to use different measures of similarity. This would also affect what is the representative point of a cluster; it may be something other than the mean. Sometimes, the representative point is constrained to be a data point, of a medoid. Algorithm 14.2 in HTF. 6

7 K-Medoids in R The simple function is kmedoids in the package clue. The first argument is the data and the second is the number of means: kmedoids(datadist,k) 7

8 Model-Based Hierarchical Clustering In the general clustering problem, we may assume that the data come from several distributions, and our problem is to identify the distribution from which each observation arose. Without further restrictions, this problem is ill-posed; no solution is any better than any other. We may, however, impose the constraint that the distributions be of a particular type. We may then formulate the problem as one of fitting the observed data to a mixture of distributions of the given type. The problem posed thusly is similar to the problem of density estimation using parametric mixtures. The R function mclust performs model-based clustering. 8

9 Clusters of Models We consider data with covariates; that is, in addition to the variable of interest y, there is an associated variable x. (Either may be vector-valued.) We consider models of the form y = f(x, θ)+ɛ, and we denote the systematic component of the model in the g th group as f g (x, θ g ). This notation allows retention of the original labels of the dataset. In some early work, this problem was called clusterwise regression. It is also called latent class regression or regression clustering. 9

10 Approaches There are essentially two ways of approaching the problem. They arise from slightly different considerations of why clusters are clusters. These are based on combining the notion of similar relationships among the variables with the property of closeness or density of the elements, or else with the property of a common probability distribution. 10

11 Clustering Based on a Probability Distribution Model If the number of clusters is fixed to be, say k, and if the data in each cluster are considered to be a random sample from a given family of probability distributions, we can formulate the clustering problem as a maximum likelihood problem. For a mixture of k distributions, if the PDF of the j th distribution is p j (x; θ j ), the PDF of the mixture is p(x; θ) = k j=1 π j p j (x; θ j ), where π j 0 and k j=1 π j = 1. The g-vector π is the unconditional probabilities for a random variable from the mixture. 11

12 EM Methods We define k 0-1 dummy variables to indicate the group to which an observation belongs. These dummy variables are not observed, and this leads to the classic EM formulation of the mixture problem. C = (Y, U, x) where Y is an observed random variable, U is an unobserved RV, and x is a (vector) observed covariate. 12

13 EM Methods The E-step yields conditional expectations of the dummy variables. For each observation, the conditional expectation of a given dummy variable can be interpreted as the provisional probability that the observation is from the population represented by that dummy variable. The M-step yields an optimal fit of the model in each group, using the group inclusion probabilities as weights. If it is not practical to use a weighted fit in the M-step, we can first define a classification likelihood, as in Fraley and Raftery (2002). Maximizing the classification likelihood results in each observation being assigned to exactly one group. We could also use the conditional expectations of the dummy variables as probabilities for a random assignment of each observation to a single group. 13

14 Issues with an EM Method The EM method is based on a rather strong model assumption. A likelihood must be formulated. As we said earlier, however, the EM approach can be used even as we change the objective function in the maximization. Two other problems are endemic in EM methods: slowness convergence to local optima 14

15 Regression Clustering In regression clustering, we assume a model of the form y = f g (x, θ g ) + ɛ g for observations y and x in the g th group. Usually, of course, we assume linear models of the form y = x T β g +ɛ g, and assume ɛ g N(0, σg 2 ) and observations are mutually independent. The distribution of the error term allows us to formulate a likelihood, and this provides us the necessary quantities for the EM method. 15

16 EM Methods While an EM method is relatively easy to program, the R package flexmix developed by Friedrich Leisch (2004) provides a simple interface for an EM method for various kinds of regression models. The package allows models of different forms for each group. It uses the classes and methods of R and so is very flexible. The M-step is viewed as a fitting step, and the structure of the package makes it relatively simple matter to use a different fitting method, such as constrained or penalized regression. 16

17 Variable Selection within the Groups As a practical matter, it is generally convenient to fit a model of the same form and with the same covariates within each group. A slight modification is to select different covariates within each group. Under the usual set up with models of the form y = x T β g + ɛ g this has no effect on the formulation of the likelihood, but it does introduce the additional step of variable selection in the M-step. One way of approaching this is to substitute a lasso fit for the usual LS (i.e. ML) fit. The R package lasso2 developed by Lokhorst, Venables, and Turlach (2006) provides a parameter to lasso fitting that will drive insignificant coefficients to zero. Alternatively, a lars approach coupled with use of the L-curve could be used to determine a fit. 17

18 Other Variations on the M-Step Rather than viewing the M-step as part of a procedure to maximize a likelihood, we can view it as a step to fit a model using any criterion. This, of course, changes the basic approach in finding groups in data based on regression models so that it is no longer based on MLE. Even while using an EM method, the approach now is based on a heuristic notion of good fits of individual models and clustering the observations based on best fits. 18

19 Clustering Based on Closeness Elements within a cluster are close to each other. If we define a distance as a dissimilarity for a given element to some overall measure of a given cluster, the clustering problem is to minimize the dissimilarities within clusters. In some cases it is useful to consider a fuzzy cluster membership, but in the following, we will assume that each observation is in exactly one cluster. We denote the dissimilarity of the observation y i to the other members of the g th cluster as d g (y i ), and we define d g (y i ) = 0 if y i is not in the g th cluster. 19

20 Clustering Based on Closeness A given clustering is effectively a partitioning of the dataset. We denote the partition by P, which is a collection of disjoint sets of indices whose union is the set of all indices in the sample, P = {P 1,..., P k }. The sum of the discrepancies f(p) = is a function of the clustering. k n g=1 i=1 d g (y i ). For a fixed number of clusters, k, this is the objective function to be minimized with respect to partitioning P. This is the basic idea of k means clustering. 20

21 Clustering Based on Closeness Singleton clusters need special consideration, as does the number of clusters, k. Depending on how we define the discrepancies, and how we measure the discrepancy attributable to a singleton cluster, we could incorporate choice of k into the objective function. 21

22 Clusters of Models For y in the g th group, the discrepancy is a function of the observed y and its predicted or fitted value, d g (y i ) = h g (y i, f g (x i, θ g )), where h g (y i, ) = 0 if y i is not in the g th cluster. In many cases, h g (y i, f g (x i, θ g )) = h g (y i f g (x i, θ g )); that is, the discrepancy is a function of the difference of the observed y and its fitted value. 22

23 Measures of Dissimilarity The measure of dissimilarity is a measure of the distance of a given observation to the center of the group of which it is a member. There are two aspects to measures of dissimilarity. the type of centers mean, median, harmonic mean the type of distance measure The distance measure is usually a norm, and most often an Lp norm L 1, L 2, L. 23

24 K-Means Type Methods In K-means clustering, the objective function is f(p) = k g=1 i P g y i ȳ g 2, where ȳ g is the mean of the observations in the g th group. In K-models clustering, ȳ g is replaced by f g (x, ˆ theta g ). 24

25 The data shown are similar to some astronomical data from the Sloan Digital Sky Survey (SDSS). The data are two measures of absolute brightness of a large sample of celestial objects. An astronomer had asked for our help in analyzing the data. (The data in the figure are not the original data; that dataset was massive, and would not show well in a simple plot.) The astronomer wanted to fit a regression of one measure on the other. 25

26 26

27 We could fit some model to the data, of course, but the question is what kind of model? Four possibliities are straight line curved line (polynomial? exponential?) segmented straight lines overlapping functions 27

28 28

29 Objectives As in any data analysis, we must identify and focus on the objective. If the objective is prediction of one variable given another, some kind of single model would be desirable. Adopting a more appropriate attitude toward the problem, however, we see that there is something more fundamental going on. It is clear that if we are to have any kind of effective regression model, we need another independent variable. We might ask whether there are groups of different types of objects as suggested by the different models for different subsets of the data. We could perhaps cluster the data based on model fits. Then if we really want a single regression model, a cluster identifier variable could allow us to have one. 29

30 Clusters We can take a purely data-driven approach to defining clusters. From this standpoint, clusters are clusters because the elements within a cluster are closer to one another, or they are dense, the elements within a cluster follow a common distribution, or the variables (attributes) in all elements of a cluster have similar relationships among each other. 30

31 In an extension of a data-driven approach, we may identify clusters based on some relationship amont the variables. The relationship is expressed as a model; perhaps a linear regression model. In this sense, the clusters are conceptual clusters. The clusters are clusters because a common model fits their elements. 31

32 32

33 Issues in Clustering Although we may define a clustering problem in terms of a finite mixture distribution, clustering problems are often not built on a probability model. The clustering problem is usually defined in terms of an objective function to minimize, or in terms of the algorithm that solves the problem. In most mixture problems we have an issue of identifiability. The meanings of the group labels cannot be determined from the data, so any solution can be unique only up to permutations of the labels. Another type of identifiability problem arises if the groups are not distinct (or, in practice, sufficiently distinct). This is similar to an over-parameterized model. 33

34 Clusters of Models In regression modeling, we treat one variable as special, and treat other variables as covariates; that is, in addition to the variable of interest y, there is are associated variables x, which is the vector of all other relevant variables. (The variable of interest may also be vector-valued of course.) The regression models have the general form y = f(x, θ) + ɛ. To allow the models to be different in different clusters, we may denote the systematic component of the model in the g th group as f g (x, θ g ). This notation allows retention of the original labels of the dataset. 34

35 Approaches There are essentially two ways of approaching the problem. They arise from slightly different considerations of why clusters are clusters. These are based on combining the notion of similar relationships among the variables with the property of a common probability distribution, or else with the property of closeness or density of the elements. If we assume a common probability distribution for the random component of the models, we can write a likelihood, conditional on knowing the class of each observation. From the standpoint of clusters defined by closeness, we have an objective function that involves norms of residuals. 35

36 Clustering Based on a Probability Distribution Model If the number of clusters is fixed to be k, say, and if the data in each cluster are considered to be a random sample from a given family of probability distributions, we can formulate the clustering problem as a maximum likelihood problem. For a mixture of k distributions, if the PDF of the j th distribution is p j (x; θ j ), the PDF of the mixture is p(x; θ) = k j=1 π j p j (x; θ j ), where π j 0 and k j=1 π j = 1. The g-vector π is the unconditional probabilities for a random variable from the mixture. 36

37 EM Methods If we consider each observation to have an additional variable that is not observed, we are led to the classic EM formulation of the mixture problem. We define k 0-1 dummy variables to indicate the group to which an observation belongs. These dummy variables are the missing data in EM formulation of the mixture problem. The complete data in each observation is C = (Y, U, x), where Y is an observed random variable, U is an unobserved random variable, and x is a (vector) observed covariate. The E-step yields conditional expectations of the dummy variables. For each observation, the conditional expectation of a given dummy variable can be interpreted as the provisional probability that the observation is from the population represented by that dummy variable. The M-step yields an optimal fit of the model in each group, using the group inclusion probabilities as weights. 37

38 Classification Variables The conditional expectations of the 0-1 dummy variables can be viewed as probabilities that each observation is in the group represented by the dummy variable. There are two possible ways of treating the dummy classification variables viewed as probabilities. One way is to use these values as weights in fitting the model at each step. This way usually results in less variability the EM steps. If it is not practical to use a weighted fit in the M-step, each observation can be assigned to a single group. Another way at the conclusion of the EM computations, is to assign If each observation is assigned to the group corresponding to the dummy variable with the largest associated conditional expectation, we can view this as maximizing a classification likelihood (see Fraley and Raftery, 2002). 38

39 We could also use the conditional expectations of the dummy variables as probabilities for a random assignment of each observation to a single group if a weighted fit is not practical.

40 Fuzzy Membership Interpreting the conditional expected values of the classification variables as probabilities naturally leads to the idea of fuzzy group membership. In the case of only two groups, we may separate the observations into three sets, two sets corresponding to the two groups, and one set that is not classified. This would be based on some threshold value, α > 0.5. If the conditional expected value of a classification variable is greater than α, the observation is put in the cluster corresponding to that variable; otherwise, the observation is not put in either cluster. 39

41 40

42 In the case of more than two clusters, the interpretation of the classification variables can be extended to represent likely membership in some given cluster, membership in two given clusters, or in some combination of any number of clusters. If the likely cluster membership is dispersed among more than two or three clusters, however, it is probably best just to leave that observation unclustered. There are other situations, such as with outliers from all models, in which it may be best to leave an observation unclustered. 41

43 Issues with an EM Method The EM method is based on a rather strong model assumption so that a likelihood can be formulated. We can take a more heuristic approach, however, and merely view the M step as model fitting using any reasonable objective function. Instead of maximizing an identified likelihood, we could perform a model fit by minimizing some norm of the residuals, whether or not this corresponds to a maximization of a likelihood. There are other problems that often occur in the use of EM methods. A common one is that the method may be very slow to converge. Another major problem in applications such as mixtures is that there are local optima. This particular problem has nothing to do with EM per se, but rather with any method we may use to solve the problem. Whenever local optima may be present, there are two standard ways of addressing the problem. One is to use multiple starting points, and the other is to allow 42

44 an iteration to go in a suboptimal direction. The only one of these approaches that would be applicable in the model-based clustering would be the use of multiple starting points. We did not explore this approach in the present research.

45 Regression Clustering In regression clustering, we assume a model of the form y = f g (x, θ g ) + ɛ g for observations y and x in the g th group. Usually we assume linear models of the form y = x T β g + ɛ g, and assume that ɛ g N(0, σg 2 ) and that observations are mutually independent. The distribution of the error term allows us to formulate a likelihood, and this provides us the necessary quantities for the EM method. 43

46 EM Methods While an EM method is relatively easy to program, the R package flexmix developed by Leisch (2004) provides a simple interface for an EM method for various kinds of regression models. In our experience with EM methods as implemented in this package, we rarely had problems with the EM methods being slow slow to converge in the clustering applications. We also did not find that they were particularly sensitive to the starting values (see Li and Gentle, 2007). The M-step is viewed as a fitting step, and the structure of the package makes it relatively simple matter to use a different fitting method, such as constrained or penalized regression. 44

47 Models with Many Covariates Models with many covariates are more interesting. In such cases, however, it is likely that different sets of covariates are appropriate for different groups. Use of all covariates would lead to overparametrized models, and hence the fits have larger variance. While this may still result in an effective clustering, it would seriously degrade the performance of any classification scheme based on the fits. 45

48 Variable Selection within the Groups As a practical matter, it is generally convenient to fit a model of the same form and with the same covariates within each group. A slight modification is to select different covariates within each group. Under the usual set up with models of the form y = x T β g + ɛ g this has no effect on the formulation of the likelihood, but it does introduce the additional step of variable selection in the M-step. Although models with different sets of independent variables can be incorporated in the likelihood, the additional step of variable selection can result present problems of computational convergence, as well as major analytic problems. For variable selection in regression clustering, we need a procedure that is automatic. 46

49 Penalized Likelihood for Variable Selection within the Groups A lasso fit for variable selection can be inserted naturally in the M-step of the EM method; that is, instead of the usual least squares fit, which corresponds to maximum likelihood in the case of a known model with normally distributed error, we minimize u ig (y i x T i b g) 2 + λ b g 1 We could interpret this as maximizing a penalized likelihood. Rather than viewing the M-step as part of a procedure to maximize a likelihood, we can view it as a step to fit a model using any reasonable criterion. This, of course, changes the basic approach in finding groups in data that are based on regression models so that it is no longer based on maximum likelihood estimation, but the same upper-level computational methods can be used. 47

50 Alternatively, a lars approach coupled with use of the L-curve could be used to determine a fit, but either way, the lasso fit often yields models with some fitted coefficients exactly 0. The use of lasso of course biases the estimators of the selected variables downward. The overall statistical properties of the variable selection procedure are not fully understood. Lasso fitting seems useful within the EM iterations, however. At the end, the variables selected within the individual groups can be fitted by regular, that is, nonpenalized least squares. Even while using an EM method, the approach now is based on a heuristic notion of good fits of individual models and clustering of the observations based on best fits.

51 Clustering Based on Closeness The idea of forming clusters based on model fits leads us to the general idea of clustering based on closeness to a model center. Elements within a cluster are close to each other. If we define a distance as a dissimilarity for a given element to some overall measure of a given cluster, the clustering problem is to minimize the dissimilarities within clusters. In some cases it is useful to consider a fuzzy cluster membership, but in the following, we will assume that each observation is in exactly one cluster. We denote the dissimilarity of the observation y i to the other members of the g th cluster as d g (y i ), and we define d g (y i ) = 0 if y i is not in the g th cluster. 48

52 A given clustering is effectively a partitioning of the dataset. We denote the partition by P, which is a collection of disjoint sets of indices whose union is the set of all indices in the sample, P = {P 1,..., P k }. The sum of the discrepancies f(p) = is a function of the clustering. k n g=1 i=1 d g (y i ). For a fixed number of clusters, k, this is the objective function to be minimized with respect to partitioning P. This of course is the basic idea of k means clustering. In any kind of clustering method, singleton clusters need special consideration. Such clusters may be more properly considered as

53 outliers, and their numbers do not contribute to the total count of the number of clusters, k. The number of clusters is itself an important characteristic of the problem. In some cases our knowledge of the application may lead to a known number of clusters, or at least it may lead to an appropriate choice of k. Depending on how we define the discrepancies, and how we measure the discrepancy attributable to a singleton cluster, we could incorporate choice of k into the objective function.

54 Clusters of Models For y in the g th group, the discrepancy is a function of the observed y and its predicted or fitted value, d g (y i ) = h g (y i, f g (x i, θ g )), where h g (y i, ) = 0 if y i is not in the g th cluster. In many cases, h g (y i, f g (x i, θ g )) = h g (y i f g (x i, θ g )); that is, the discrepancy is a function of the difference of the observed y and its fitted value. 49

55 Measures of Dissimilarity The measure of dissimilarity is a measure of the distance of a given observation to the center of the group of which it is a member. There are two aspects to measures of dissimilarity. the type of centers mean, median, harmonic mean; this is the f g (x i, θ g ) above. the type of distance measure; this is the h g (y i f g (x i, θ g )) above. The type of center, for example, whether it is based on a least squares criterion such as a mean or based on a least absolute 50

56 values criterion such as a median, affects the robustness of the clustering procedure. Zhang and Hsu (1999) showed that if harmonic means are used instead of means in k-means clustering, the clusters are less sensitive to the starting values. Zhang (2003) used a harmonic average for the regression clustering problem; that is, instead of using the within-groups residual norms, he used a harmonic mean of the within-groups residuals. The insensitivity of a harmonic average to outlying values may cause problems when the groups are not tightly clustered within the model predictions. Nevertheless, this approach seems promising, but more studies under different configurations are needed. The type of distance measure is usually a norm of the coordinate differences of a given observation and the center. Most often

57 this is an Lp norm L 1, L 2, L. It may seem natural that the distance of an observation to the center be based on the same type of measure as the measure used to define the center, but this is not necessary.

58 K-Means Type Methods In k-means clustering, the objective function is f(p) = k g=1 i P g y i ȳ g 2, where ȳ g is the mean of the observations in the g th group. In K-models clustering, ȳ g is replaced by f g (x, ˆ theta g ). With the model predictions are used as the centers, it is the same as the substitution method used in a univariate k-means clustering algorithm. 51

59 K-means clustering is a combinatorial problem, and the methods are computationally complex. The most efficient methods currently are based on simulated annealing with substitution rules. These methods can allow the iterations to escape from local optima. Because of the local optima, however, any algorithm for k-means clustering is likely to be sensitive to the starting values. As in any combinatorial optimization problem, the performance depends on the method of choosing a new trial point, and the cooling schedule. We are currently investigating these steps in a simulated annealing method for regression clustering, but don t have any useful results yet. 52

60 K-Models Clustering Following Clustering of the Covariates When the covariates have clusters among themselves, a simple clustering method applied only to them may yield good starting values for either an EM method or a k-means method for the regression clustering problem. There may be other types of prior information about the group membership of the individual observations. Any such information, either from group assignments based on clustering of the covariates or from prior assumptions, can be used in the computation of the expected values of the classification variables. Clearly, clustering of covariates has limited effectiveness. We will be trying to characterize distributional patterns to be able to tell when preliminary clustering of covariates is useful in the regression clustering problem. 53

Regression Clustering

Regression Clustering In regression clustering, we assume a model of the form y = f g (x, θ g ) + ɛ g for observations y and x in the g th group. Usually, of course, we assume linear models of the form