arxiv: v1 [stat.me] 7 Aug 2015

Size: px
Start display at page:

Download "arxiv: v1 [stat.me] 7 Aug 2015"

Transcription

1 Dimension reduction for model-based clustering Luca Scrucca Università degli Studi di Perugia August 0, 05 arxiv:508.07v [stat.me] 7 Aug 05 Abstract We introduce a dimension reduction method for visualizing the clustering structure obtained from a finite mixture of Gaussian densities. Information on the dimension reduction subspace is obtained from the variation on group means and, depending on the estimated mixture model, on the variation on group covariances. The proposed method aims at reducing the dimensionality by identifying a set of linear combinations, ordered by importance as quantified by the associated eigenvalues, of the original features which capture most of the cluster structure contained in the data. Observations may then be projected onto such a reduced subspace, thus providing summary plots which help to visualize the clustering structure. These plots can be particularly appealing in the case of high-dimensional data and noisy structure. The new constructed variables capture most of the clustering information available in the data, and they can be further reduced to improve clustering performance. We illustrate the approach on both simulated and real data sets. Keywords: Dimension reduction, Model-based clustering, Gaussian mixture, Sliced inverse regression, Feature selection. Introduction Model-based clustering assumes that the observed data are generated from a mixture of G components, each representing the probability distribution for a different group or cluster (McLachlan and Peel, 000). The general form of a finite mixture model is f(x) = G g= π gf g (x θ g ), where the π g s are the mixing probabilities such that π g 0 and π g =, f g ( ) and θ g are, respectively, the density and the parameters of the g-th component (g =,..., G). With continuous data, we often take the density for each mixture component to be the multivariate Gaussian φ(x µ g, Σ g ) with parameters θ g = (µ g, Σ g ). Thus, clusters are ellipsoidal, centered at the means µ g, and with other geometric features, such as volume, shape and orientation, determined by Σ g. Parsimonious parametrization of covariance matrices can be adopted through eigenvalue decomposition in the form Σ g = λ g D g A g D g, where λ g is a scalar controlling the volume of the ellipsoid, A g is a diagonal matrix specifying the shape of the density contours, and D g is an orthogonal matrix which determines the orientation of the corresponding ellipsoid (Banfield and Raftery, 99; Celeux and Govaert, 995). Fraley and Raftery (006, Table ) report some parametrizations of within-group covariance matrices available in the MCLUST software, and the corresponding geometric characteristics. Maximum likelihood estimates for this type of mixture models can be computed via the EM algorithm (Dempster et al, 977; Fraley and Raftery, 00), while model selection could be based on the Bayesian information criterion (BIC) (Fraley and Raftery, 998) or the integrated complete-data likelihood (ICL) criterion (Biernacki et al, 000). In this paper we propose a dimension reduction method for model-based clustering, where we pursue the estimation of a reduced subspace which captures most of the clustering structure contained in the data. Following the work of Li (99, 000) on Sliced Inverse Regression (SIR), information on the dimension reduction subspace is obtained from the variation on group means

2 and, depending on the estimated mixture model, on the variation on group covariances. The estimated directions, ordered by importance as quantified by the associated eigenvalues, span a dimension reduction subspace. Observations may then be plotted on such a subspace, hence providing summary plots which visualize the clustering structure, and may help to understand the clustering and improve accuracy. In Section we introduce the problem of clustering on a dimension reduced subspace, and we show that assignment of an observation to a given cluster is unchanged if performed on a suitable subspace. Then we discuss estimation of the directions which span the reduced subspace, and we provide some of their properties. In Section we review and adapt a greedy search procedure for selecting a subset of the new constructed variables which retains most of the clustering information contained in the data. In Section 4 we describe a data analysis example based on a synthetic data set; then we discuss results from simulation studies where we compare the proposed approach to model-based clustering performed on both the original variables and the leading principal components. Section 5 presents applications of the proposed procedure on some real data sets. The final section contains some concluding remarks. Dimension reduction for model-based clustering Classical procedures for dimensionality reduction are principal components analysis and factor analysis, both of which reduce dimensionality by forming linear combinations of the features. The first method seeks a lower-dimensional representation that account for the most variance of the features, while the second looks for the most correlation among the features. However, neither method correctly addresses the problem of visualizing any potential clustering structure. Chang (98) discussed a simulated example from a 5-dimensional mixture model to show the failure of principal components as a method for reducing the dimension of the data before clustering. Let X = 0.5 d+d Y +Z, where d = i (i =,..., 5), Y Bernoulli(0.), Z N(µ, Σ) with mean µ 5 = [0,..., 0] and covariance matrix Σ 5 5 = [σ ij ], σ ii =, σ ij = 0.f i f j, where the first 8 elements of f are 0.9 and the last 7 are 0.5. With this scheme the first 8 variables can be considered roughly as a block of variables with the same correlations, while the rest of the variables form another block. Using this scheme we simulated n = 00 data points. Figure shows the scatterplot matrix of the first, second and last principal components, and the first GMMDR variable (to be discussed on Section.) obtained from a two components EEE mixture model. As it can be seen, the first two PCs are not able to show any clustering information, but, as discussed by Chang (98), the first and the last PCs are needed. On the contrary, only one GMMDR variable is required to clearly separate the clusters. Furthermore, the coefficients (under unit norm constraint) of the linear combination defining such a direction reproduce the blocking structure used to simulate the variables, with the first 8 variables having coefficients approximately equal to 0. and the remaining 7 coefficients approximately equal to 0.6. Tipping and Bishop (999b) proposed a probabilistic principal component analysis based on a factor analysis model for the data assuming that the distribution of the errors are isotropic, i.e. the covariance matrix is proportional to the identity matrix. They also developed a mixture of probabilistic principal components analyzers model (Tipping and Bishop, 999a), and a hierarchical visualization algorithm for exploring clusters (Bishop and Tipping, 998). McLachlan and Peel (000) and McLachlan et al (00) used mixtures of factor analyzers, by postulating that the distribution of the data within any group or latent class follows a factor analysis model. Parsimonious Gaussian mixture models based on mixtures of factor analyzers were discussed by McNicholas and Murphy (008). In order to deal with high dimensional data, Bouveyron et al (007) proposed a family of parsimonious GMMs on clustering subspaces. All such models mainly address the problem of reducing the number of parameters to be estimated by a suitable reparametrization of the covariance matrix. Since they produce estimates of (latent) factor an-

3 PCA PCA PCA GMMDR Figure : Scatterplot matrix of st, nd and 5th PC, and st GMMDR variable for the Chang data with points marked according to cluster membership. alyzers, they can also be used to reduce dimensionality. However, as noted by McLachlan and Peel (000, p. 48), the resulting plots are not always useful in showing clustering structure. To support such claim, they presented an example using a simple two clusters synthetic data set where the differences between the group means are only in one variable, and with withingroup variation in the other variables being relatively large. They showed that both principal components and factor analyzers were unable to show the underlying clustering structure (see McLachlan and Peel, 000, p. 40, 54 55). We applied the GMMDR method proposed in this paper, and the data projected onto the subspace spanned by the first two directions are shown in Figure : the two clusters appear well separated with one group showing less variation than the other.. Clustering on a dimension reduced subspace Consider the (p ) vector X of random variables, and the discrete random variable Y taking G distinct values to indicate the G clusters. Let β denote a fixed (p d) matrix, where d p, such that Y X β X. This conditional independence statement says that the distribution of Y X is the same as that of Y β X for all values of X in its marginal sample space. As a consequence, the (p ) vector X can be replaced by the (d ) vector β X without loss of clustering information. If d < p, we have effectively reduced the dimension of the predictor vector. The matrix β provides a basis for the subspace S(β), which in a regression context is called a dimension-reduction subspace (DRS) for the regression of Y on X (Li, 99). It can be shown that a minimum DRS may not be unique, but when several of such subspaces exist they all have the same dimension. Let S Y X denote the intersection of all DRSs. Under various reasonable conditions given in Cook (998, Chap. 6), S Y X is a unique minimum DRS and it is

4 Dir Dir Figure : Plot of McLachlan and Peel (000, p. 9 40) synthetic data projected along the first two GMMDR directions, with points marked according to true group origin. The curve represents clustering boundary. called central DRS. The assumption Y X β X implies that Pr(Y = g X) = Pr(Y = g β X), so we may write the density for the g-th component of the mixture model as f g (x) = Pr(Y = g x)f X(x) Pr(Y = g) = Pr(Y = g β x)f X (x) Pr(Y = g) = f g (β f X (x) x) f β X (β x), and for any two groups, g and k, the ratio of the densities is f g (x) f k (x) = f g(β x) f k (β x). Thus, if Y X β X the ratio of the conditional densities is the same whether it is computed on the original variables space or on the DRS, so the clustering information is completely contained in S(β). For instance, consider the simple case of two clusters with bivariate Gaussian distribution x (Y = j) N(µ j, I ) for j =,, where µ = [, 0], µ = [, 0], and I is the ( ) identity matrix. The ratio of the densities is thus given by f (x) f (x) = exp{ (x µ ) (x µ ) + (x µ ) (x µ )} = exp{ x }. It is straightforward to see that the direction containing all the clustering information is β = [, 0], so the ratio of the densities projected along such direction is f (β x) f (β x) = exp{ (β x β µ ) + (β x β µ ) } = exp{ (x + ) + (x ) } = exp{ x }. () 4

5 In model-based clustering an observation is assigned to cluster g by the maximum a posteriori principle (MAP), i.e. to the cluster for which the conditional probability given the data is a maximum: π g f g (x) arg g max Pr(Y = g x) = G j= π jf j (x). By equation (), the last expression is equivalent to arg g max Pr(Y = g β x) = π g f g (β x) G j= π jf j (β x), so the assignment of an observation to a cluster is unchanged if performed on the DRS. Cook and Yin (00, p. 56) considered D(Y X) = arg g max Pr(Y = g x) and defined a discriminant subspace as a subspace S such that D(Y X) = D(Y P S X) for all values of X, where P S denotes the projection operator onto S with respect to the usual inner product. The intersection of all discriminant subspaces, if it exists, is itself a discriminant subspace, and is called the central discriminant subspace (CDS) and denoted by S D(Y X). Finally, the CDS is a subset of the central DRS, S D(Y X) S Y X, and in some cases may be a proper subset. In the case of known group memberships, a variety of methods for dimensionality reduction have been proposed (Cook and Yin, 00). These methods, such as Sliced Inverse Regression (SIR) and Sliced Average Variance Estimation (SAVE), are not restricted to any particular classification method. In the following subsection we discuss estimation of a dimension reduction subspace for a model-based clustering procedure.. Estimation of GMMDR directions Suppose we describe a set of n observations on p variables through a G components Gaussian mixture model (GMM) of the form f(x) = G g= π gφ(x µ g, Σ g ). From a graphical point of view, we pursue the smallest subspace that can capture the clustering information contained in the data. A natural starting point would require to look for those directions where the cluster means µ g vary as much as possible, provided each direction is orthogonal to the others. This amounts to solving the following optimization problem: arg β max β Σ B β, subject to β Σβ = I d, where Σ B = G g= π g(µ g µ)(µ g µ) is the between-cluster covariance matrix, Σ = n n i= (x i µ)(x i µ) is the covariance matrix with µ = G g= π gµ g, β R p d is the spanning matrix, and I d is the (d d) identity matrix. The solution to this constrained optmization is given by the eigendecomposition of the kernel matrix M I Σ B with respect to Σ. The eigenvectors, corresponding to the first d largest eigenvalues, [v,..., v d ] β, provide a basis for the subspace S(β) which shows the maximal variation among cluster means. There are at most d = min(p, G ) directions which span this subspace. This procedure is similar to the sliced inverse regression (SIR) algorithm (Li, 99). SIR is a dimension reduction method which exploits the information from the inverse mean function (see Li, 000, Chap. ). Instead of conditioning on the response variable (or a sliced version of it if continuous), here we condition on the cluster memberships. SIR directions span at least part of the dimension reduction subspace (Cook, 998, Chap. 6) and they provide an intuitive and useful basis for constructing summary plots. However, they may miss relevant structures in the data when within-cluster covariances are different. The SIR II method (Li, 000, Chap. 5) exploits the information coming from the differences in the class covariance matrices. The kernel matrix is now defined as: G M II = π g (Σ g Σ)Σ (Σ g Σ), g= 5

6 where Σ = G g= π gσ g is the pooled within-cluster covariance matrix, and directions are found through the eigendecomposition of M II with respect to Σ. Albeit the corresponding directions show differences in group covariances, they are usually not able to also show location differences. The proposed approach aims at finding those directions which, depending on the selected Gaussian mixture model, are able to display both variation in cluster means and variation in cluster covariances. Definition. Consider the following kernel matrix M = M I Σ M I + M II. () The basis of the dimension reduction subspace S(β) is the solution of the following constrained optimization: arg β max β Mβ, subject to β Σβ = I d. This is solved through the generalized eigendecomposition Mv i = l i Σv i, v i Σv j = if i = j, and 0 otherwise, l l... l d > 0. () Thus, the basis of the required subspace is given by the GMMDR directions β [v,..., v d ]. The kernel matrix () contains information from variation on both cluster means and cluster covariances. For mixture models which assume constant within-cluster covariance matrices, such as models E, EII, EEI, and EEE (see Fraley and Raftery, 006, Table ), the subspace spanned by M is equivalent to that spanned by M I. Remark. Let S(β) be the subspace spanned by the (p d) matrix β of eigenvectors from (), and µ g and Σ g be the means and covariance, respectively, for the g-th cluster. It can be easily seen that the projection of the parameters onto the subspace S(β) are given by β µ g and β Σ g β. Furthermore, the projection of the (n p) data matrix X onto S(β) can be computed as Z = Xβ; we call these GMMDR variables. Remark. The raw coefficients of the estimated directions are uniquely determined up to multiplication by a scalar, whereas the associated directions from the origin are unique. Hence, we can adjust their length such that they have unit length, i.e. each direction is normalized as β j v j / v j for j =,..., d. In matrix form, let D = diag(v V ), where V is the matrix of eigenvectors from (), then β V D /. For the GMMDR variables Cov(Z) = β Σβ = D / V ΣV D / = D = diag(/ v j ). Thus, they are uncorrelated, and with variances inversely related to the squared length of the eigenvectors of the kernel matrix, while GMMDR directions are orthogonal with respect to Σ-inner product. We now provide some properties as propositions whose proofs are reported in the Appendix A.. Proposition. GMMDR directions are invariant under affine transformation x Cx + a, for C a nonsingular matrix and a a vector of real values. Thus, GMMDR directions in the transformed scale are given by C β. Proposition. Each eigenvalue of the eigendecomposition in () can be decomposed in the sum of the contributions given by the squared variance of the between group means and the average of the squared within group variances along the corresponding direction of the projection subspace, i.e. l i = Var(E(Z i Y )) + E(Var(Z i Y ) ) for i =,..., d. 6

7 Based on this result we may provide an interpretation to the contribution of each direction to the visualization of the clustering structure. Furthermore, directions corresponding to small eigenvalues provide little or no information about differences in cluster means or covariances. Remark. Let X be the (n p) sample data matrix which we assume, with no loss of generality, to have zero mean column-vectors. The sample version M of () is obtained using the corresponding estimates from the fit of a Gaussian mixture model via the EM algorithm. Then, GMMDR directions are calculated from the generalized eigen-decomposition of M with respect to Σ. Remark 4. The dimension of the estimated subspace S(β) is d = min{p, G } for models which assume equal within-cluster covariance matrices (i.e. models E, EII, EEI, and EEE), while d p otherwise. The GMMDR directions are ordered based on the associated eigenvalues, so those directions corresponding to approximately zero eigenvalues can be discarded in practice since clusters largely overlap along such directions. A more formal approach to the selection of relevant directions is discussed in Section. Subset selection of GMMDR variables In the previous section we have discussed the estimation of GMMDR variables, which is basically a linear mapping method onto a suitable subspace. This can be viewed as a form of (soft) feature extraction, where the components are reduced through a set of linear combinations of the original variables. However, it may be possible that a subset of the estimated GMMDR variables provide either no clustering information or redundant clustering information already provided by other variables. Consider the partitioned (p p) matrix (β, β 0 ), which forms a basis for R p. Based on proposition of Cook and Yin (000, p.57), if β 0 X β X Y and β 0 X Y then S(β) is a DRS. This is useful since, albeit it does not indicate how to construct β, it indicates that unnecessary variables β 0 X are independent from relevant variables β X within groups, and that they are also marginally independent from Y. In our case, given that unnecessary GMMDR variables provide no clustering information but require parameters estimation, we would like to detect and discard them. In the following we consider a subset selection method for GMMDR variables following the approach of Raftery and Dean (006), who proposed the use of BIC to approximate Bayes factors for comparing mixture models fitted on nested subsets of variables. A generalization of this approach has been recently discussed by Maugis et al (009).. A criterion for feature selection Let S be the set of indices with dim(s ) = q indexing a subset of q features from the original d GMMDR variables Z (q d), and S = {S \ i} S be the set of dim(s ) = q obtained excluding the i-th feature index from S. The comparison of the two nested subsets can be recast as a model comparison problem and addressed using BIC difference as an approximation to Bayes factor. Following Raftery and Dean (006), the BIC difference for the inclusion of the i-th feature is given by BIC diff (Z i S ) = BIC clust (Z S ) BIC not clust (Z S ), (4) where BIC clust (Z S ) is the BIC value for the best model fitted using features in S, and BIC not clust (Z S ) is the BIC value for no clustering. The latter can be written as BIC not clust (Z S ) = BIC clust (Z S ) + BIC reg (Z i Z S ), i.e. the BIC value for the best model fitted using features in S plus the BIC value for the regression of the i-th feature on the remaining (q ) features in S. Noting that GMMDR 7

8 variables are orthogonal, the general formula for BIC reg (see Raftery and Dean, 006, eq. (7)) reduces to BIC reg (Z i Z S ) = n log(π) n log( σ i ) n (q+) log(n), where σ i is the maximum likelihood estimate of the i-th feature variance. In all cases the best clustering model is identified with respect to both the number of mixture components and model parametrization.. A greedy search algorithm for selecting the best subset GMMDR variables are defined as orthogonal linear combinations of the original variables, and it is likely that only a subset of them is needed for clustering. However, the space of all possible subsets of size q, with q ranging from to d, has number of elements equal to d. An exhaustive search soon becomes unfeasible, even for moderate values of d. To alleviate this problem, Raftery and Dean (006) proposed a greedy search algorithm for finding a local optimum in the model space. The search has both a forward step which evaluates inclusion of a proposed variable, and a backward step which evaluates exclusion of one of the currently included variables. The algorithm stops when consecutive inclusion and exclusion steps are rejected. Since GMMDR variables are orthogonal the search can be simplified by skipping the backward step. This allows us to considerably speed up the computing time. The search starts by selecting the single GMMDR variable which maximizes the BIC criterion. This is equivalent to using the statistic in equation (4), with S = and S = {i} for any i =,..., d. Thus, at the first step equation (4) measures the difference between the best clustering model and the single cluster model for each variable. In the following steps, we select the GMMDR variate which maximizes the BIC difference (4), so taking into account the features already included. We keep adding GMMDR variates until the BIC difference becomes negative. A detailed description of the greedy search algorithm is reported in Appendix A.. At each step, the search over the model space is usually performed with respect to both the model parametrization and the number of clusters. However, we may want to fix the model parameters at the estimated values obtained using the full set of features; in such a case, the order in which the GMMDR directions are selected follows the ordering of the associated eigenvalues.. An iterative procedure for feature selection The greedy search discussed in the previous section is likely to discard some of the GMMDR variates, in particular those associated with small eigenvalues which do not carry any clustering information once other features have been included. A GMM may then be fitted on such constructed variables and the corresponding GMMDR directions estimated. The feature selection step can then be repeated, and the whole process iterated until no directions could be dropped. Depending on the number of initial features, the number of iterations required is usually small. The whole process is described in the following algorithm: Algorithm: GMMDR estimation and feature selection process. Fit a GMM on the original data.. Estimate GMMDR directions using the method in Section. and project the data onto the estimated subspace.. Perform feature selection using the greedy search discussed in Section.. 4. Fit a GMM on the selected GMMDR variables and go to step. 5. Repeat steps 4 until none of the features could be dropped. 8

9 4 Simulations In this section we present a data analysis example based on a synthetic data set to describe the proposed approach, followed by some simulation studies. The adjusted Rand index (ARI), proposed by Hubert and Arabie (985), is adopted for evaluating the clustering obtained and comparing the results arising from competing mixture models. The ARI gives a measure of agreement between two partitions, one estimated by a statistical procedure independent of the labelling of the groups, and one being the true classification. This index has zero expected value in the case of random partition, and it is bounded above by. Thus, higher values represent a better performance. 4. Synthetic data on three overlapping clusters with different shapes Consider a simulated data set with three overlapping clusters on ten variables, with each cluster sample size equal to n g = 50 (g =,, ). The first three variables have been generated from a multivariate normal distribution with means µ = [0, 0, 0], µ = [4,, 6], µ = [, 4, ], and within-groups covariances Σ = , Σ = , and Σ = Seven noise variables, generated independently from a standard normal distribution, were also added. Thus, clustering information is only contained in the first three variables, with the remaining ones been simply noise. The correct model we hope to find is the VVV model with G =. The GMM with the largest BIC is the seven clusters EEI model, closely followed by the model with six clusters. Both models have the wrong number of components, thus yielding a poor classification accuracy. Constraining G = the selected model improves the ARI, albeit there are still some misclassified data points (see Table ). Using the best unconstrained GMM we obtained the GMMDR directions; there are d = 6 of such directions, but only the first three are retained by the feature selection step. As it can be seen from Table, the final VVV mixture model with clusters fitted on the three selected GMMDR variables is able to recover the original clustering structure. Table : Clustering results for the three clusters simulated data. Method Model G BIC ARI GMM (unconstrained) EEI GMM (constrained) VII GMMDR ( directions) VVV The left panel of Figure shows the eigenvalues associated with the GMMDR directions. From such a plot we can see that the first two directions are the most important, with the first showing both difference in means and variances among clusters, while the second almost only differences in means. The last direction adds a small amount of clustering information, mostly from differences in variances. The coefficients for the selected GMMDR directions (see the right panel of Figure ) indicate that only the first three variables seem to be required, since the coefficients for the remaining ones are almost zero. Graphs contained in Figure 4 use GMMDR directions to convey clustering information about the fitted GMM. Plots along the diagonal show univariate component densities, while plots above the diagonal report contours of the estimated densities for each component. Such 9

10 Dir Dir Dir Eigenvalues Eigenvalues Means contrib. Vars contrib. x x x x4 x5 x6 x7 x8 x9 x (4.46%) (40.8%) (5.75%) Directions Figure : Plot of eigenvalues for each estimated direction with contributions from means and variances differences among clusters (left panel) and coefficients defining the GMMDR directions (right panel). graphs clearly reflect the characteristics of the mixture model fitted; in particular, clusters have different volume, shape and orientation. Furthermore, it is quite clear that clusters on the projection subspace appear to be well separated along the first two directions, as it can be seen from the maximum a posteriori (MAP) classification areas below the diagonal of Figure Simulation studies Here we discuss the results from some simulation studies we conducted to compare the proposed GMMDR approach to model-based clustering performed on both the original variables and the leading principal components. The data sets used for such comparison are generated from some overlapping Gaussian mixtures with different covariance matrix structures. In addition, we investigated the effect of the inclusion of a certain number of redundant and noise variables. The former are variables correlated with those bringing clustering information, so they only provide redundant information. The latter are variables whose distribution do not depend on the cluster label, thus they only introduce some noise which tends to obscure the underlying clustering structure to be recovered. The simulation schemes used for generating the data are described below. Model : three overlapping clusters with common covariances. In the first simulation scheme each data set was generated from three overlapping clusters with equal mixing probabilities on three variables. These were generated from a multivariate normal distribution with 0

11 Dir Dir Dir Figure 4: Scatter plots of GMMDR variables for the three clusters simulated data: bivariate component density contours (above diagonal), marginal univariate component densities (diagonal), and MAP regions (below diagonal). means µ = [0, 0, 0], µ = [0,, ], µ = [,, ], and common covariance matrix Σ = To this basic setup, which correponds to a EEE mixture model (see Fraley and Raftery, 006, Table ), we added two further scenarios. In the first scenario (noise variables), seven noise variables were generated from independent standard normal variables. In the second scenario (redundant and noise variables), we generated three variables correlated with each clustering variable (with correlation coefficients equal to 0.9, 0.7, 0.5, respectively) and four independent standard normal variables. In both cases, there was a total of ten variables. Model : three overlapping clusters with constant shapes. In this simulation scheme we generated observations with equal prior from three classes using the mixture model VEV. The means for each class were µ = [0, 0, 0], µ = [4,, 6], and µ = [, 4, ], while for the covariance matrices Σ g = λ g D g AD g (g =,, ) the scale, shape and orientation

12 parameters were, respectively, λ = [0., 0.5, 0.8], A = diag(,, ), and D = , D =.., D = As for the previous model setup, we also considered two further scenarios with the addition of noise variables in the first case, and noise and redundant variables in the second case. Model : three overlapping clusters with unconstrained covariances. For this last scheme we simulated each data set from three overlapping clusters with equal mixing probabilities on three variables. Clustering variables were generated from a multivariate normal distribution with parameters set as in the synthetic example of Section 4., which correspond to a VVV parametrization with G =. Again, we also considered the effect of adding only noise variables, and noise and redundant variables. For each data set simulated as described above we fitted a GMM with number of clusters in the range [, 5] and different model parametrizations, then we selected the model with the highest BIC. A GMM was also fitted to the leading principal components computed from the correlation matrix, or equivalently using standardized variables; the number of components was chosen using the Kaiser s rule, i.e. selecting those with eigenvalues larger than one (Jolliffe, 00, p. 4 5). Finally, we applied the GMMDR approach discussed in Section., followed by the subset selection procedure of Section. For each estimated model we computed the adjusted Rand index (Hubert and Arabie, 985) as a criterion to evaluate the clustering partitions obtained: higher values of ARI correspond to better performance. The results of the simulation procedure based on 000 repetitions for increasing sample sizes are shown on Table, where we reported the average adjusted Rand index for the three fitting procedures to be compared and the corresponding standard errors. When no noise variables are included, the GMMDR procedure yields equivalent clustering accuracy to GMM on the original variables. However, when irrelevant features (either noise or redundant ones) are present the GMMDR procedure significantly improves the accuracy as measured by the adjusted Rand index. This is particular evident for smaller sample sizes (n = 00, 00) when noise variables are present, and when the underlying clusters have different within cluster covariance matrices. The PCA+GMM procedure leads to uniformly small values of ARI, confirming that reducing the dimensionality of the problem through principal components does not help to discover clustering structures in the data. As expected, as the sample size increases all the three methods improve in accuracy, with the GMM method on the original variables and the GMMDR approach leading essentially to the same accuracy for n = 000. A further aspect was also checked, the number of clusters selected by each procedure (results not shown here). Recalling that for all the simulation schemes the number of clusters was three, GMMDR almost always selects the true value, whereas GMM tends to select a larger number of components when noise variables are included and the clusters have different covariance structures. PCA+GMM performs worst than the other two methods, often under-estimating the true number of clusters in the presence of noise variables and for smaller sample sizes (n = 00, 00). Based on the simulation results we may conclude that when there are redundant or noise variables a suitable subset of GMMDR variables allows us to significantly improve cluster identification. The improvement is larger for small samples having different within-cluster covariances and in the presence of noise variables. Overall, using GMMDR variables do not produce worse results than using original variables, but often leads to a better selection of the clustering structure. There appears no reason to perform model-based clustering on principal components.

13 Table : Average adjusted Rand index (with standard errors within parenthesis) based on 000 simulations. No noise Noise variables Noise and redundant variables Model (EEE) n = 00 n = 00 n = 000 n = 00 n = 00 n = 000 n = 00 n = 00 n = 000 GMM (0.040) (0.066) (0.008) (0.6) (0.7) (0.0090) (0.889) (0.04) (0.0087) PCA+GMM (0.66) (0.0446) (0.0) (0.700) (0.56) (0.098) (0.4) (0.0909) (0.09) GMMDR (0.047) (0.07) (0.0088) (0.46) (0.04) (0.0090) (0.6) (0.097) (0.0088) Model (VEV) n = 00 n = 00 n = 000 n = 00 n = 00 n = 000 n = 00 n = 00 n = 000 GMM (0.00) (0.04) (0.007) (0.085) (0.079) (0.008) (0.09) (0.57) (0.0078) PCA+GMM (0.745) (0.8) (0.084) (0.86) (0.700) (0.47) (0.780) (0.48) (0.009) GMMDR (0.05) (0.04) (0.007) (0.0799) (0.065) (0.008) (0.08) (0.069) (0.0077) Model (VVV) n = 00 n = 00 n = 000 n = 00 n = 00 n = 000 n = 00 n = 00 n = 000 GMM (0.05) (0.004) (0.004) (0.60) (0.89) (0.008) (0.9) (0.094) (0.009) PCA+GMM (0.054) (0.005) (0.005) (0.66) (0.77) (0.06) (0.5) (0.0984) (0.06) GMMDR (0.046) (0.004) (0.004) (0.84) (0.00) (0.008) (0.609) (0.085) (0.0099) 4.. Unequal prior probabilities The previous simulation examples assumed equal prior group probabilities. It may be of interest to investigate if the results obtained depend on such assumption. We expect that, provided there is enough cluster information in the data, no dramatic change from previous findings should be observed. Thus, we replicated the previous simulation study for the more complex Model (VVV), but with unequal prior probabilities set at π = (0., 0., 0.6). The results are shown in Table. When no noise variables are included the accuracy of GMM and GMMDR are essentially the same as in the equal prior probabilities case, whereas PCA+GMM performs slightly worse. If irrelevant variables (either noise or redundant ones) are present both GMM and GMMDR yield somewhat worse results when n = 00 or 000. This seems to be due to the fact that BIC often selected models with more components than the true number of clusters. On the contrary, PCA+GMM seems to improve when redundant variables (i.e. variables correlated with clustering ones) are included. Since principal components are computed from the correlation matrix, we argued that correlated variables are likely to provide useful clustering information when some groups are represented by few observations. Overall, we note that GMMDR never performed worst than GMM, and it was also better than PCA+GMM except when irrelevant variables are included and n = High-dimensional data Clustering high-dimensional data is a challenging task. The main problem in fitting a GMM model is the large number of parameters which need to be estimated. To overcome this Ghosh and Chinnaiyan (00) suggested to perform model-based clustering on the first few principal components. Based on the conclusion of Chang (98), discussed also in Section, this approach does not seem to be advisable. Bouveyron et al (007) proposed a method for dealing with high

14 Table : Average adjusted Rand index (with standard errors within parenthesis) based on 000 simulations with unequal prior probabilities. Model (VVV) No noise Noise variables Noise and redundant variables n = 00 n = 00 n = 000 n = 00 n = 00 n = 000 n = 00 n = 00 n = 000 GMM (0.050) (0.005) (0.000) (0.468) (0.66) (0.60) (0.4) (0.88) (0.097) PCA+GMM (0.0970) (0.087) (0.05) (0.0748) (0.050) (0.07) (0.0590) (0.05) (0.054) GMMDR (0.074) (0.004) (0.00) (0.4) (0.78) (0.547) (0.0) (0.60) (0.600) dimensional data. In Section 4. we described a three clusters VEV model () on clustering variables. We also considered the inclusion of redundant variables (i.e., correlated with the clustering ones) and 4 noise variables; there was a total of ten variables according to the scheme { 4}. This basic setup has been generalized to investigated the behavior of the proposed approach as the number of variables increase. For k = {,,, 5} we generated 0k variables following the scheme {k k 4k}; for instance, if k =, there are 0 variables, 9 of them are clustering variables, 9 are correlated with the previous ones, and the remaining are noise variables. We only investigated the most difficult case, i.e. the case of a small sample size, by setting n = 00. As the number of variables increase the accuracy of all the three methods tend to decrease, with GMMDR always more accurate on the average that the other two methods (see Table 4). However, for p = 40 and p = 50 the average ARI for GMMDR dropped just above 0.6. Looking in more detail the results, we realized that, even though a subset of directions could provide better cluster accuracy, BIC frequently selected too many of such directions. To find evidence of this fact we replicated the simulation study, but using the simple model EII (common spherical covariance matrix) and selecting the GMMDR directions on the basis of the entropy criterion (Celeux and Soromen, 996). This provides a measure of the overlap of the clusters, and it is defined as G g= n i= t ig log(t ig ), where t ig denotes the conditional probability that the i-th unit arises from the g-th mixture component. Except for the case p = 0, where the ARI is smaller, and p = 0, where the ARI is essentially equivalent, the accuracy of GMMDR is largely increased (see Table 4). Note also that, although we constrained to fit the wrong EII model, the accuracy of GMM do not worsen. Based on this limited study, it seems that for high-dimensional data GMMDR directions are still able to recover the clustering structure of the data, but the BIC criterion overestimates how many of them are needed. When the entropy criterion is used to select relevant directions the clustering accuracy improve. An extreme form of high-dimensional data is the p n case. Clustering of gene expression data arising from microarray experiments is a typical application and model-based approaches have been proposed (Yeung et al, 00). In order to apply the approach discussed in this paper, a form of regularization is required for the inversion of the covariance matrix in (), such as one of those discussed in Bernard-Michel et al (009). However, these aspects require further studies. 5 Data analysis examples 5. Italian wines Forina et al (986) reported data on 78 wines grown in the same region in Italy but derived from three different cultivars (Barolo, Grignolino, Barbera). For each wine measurements of 4

15 Table 4: Average adjusted Rand index (with standard errors within parenthesis) based on 500 simulations for increasing dimensionality of the feature space and n = 00. The last two rows show the results from fitting a GMM with a common spherical and equal volume model (EII) and the corresponding GMMDR with relevant directions selected by the entropy criterion. Model (VEV) p = 0 p = 0 p = 0 p = 40 p = 50 GMM PCA+GMM (0.0850) (0.070) 0.78 (0.0744) (0.076) (0.078) GMMDR (0.8) (0.655) (0.574) (0.58) (0.8) 0.6 (0.084) (0.095) (0.960) (0.09) (0.90) GMM (EII) GMMDR (0.0889) (0.0707) (0.0696) (0.075) (0.0765) 0.88 (0.67) (0.087) (0.080) (0.58) (0.9) chemical and physical properties were made, such as the level of alcohol, the level of magnesium, the color intensity, etc. The data set is avalaible at UCI Machine learning data repository The following analyses were performed on a standardized scale. Modelling all the available variables with a GMM selects the model VEI with eight clusters, which gives a small adjusted Rand index (see Table 5). This can be largely improved by the variable selection method proposed by Raftery and Dean (006). A comparable accuracy is also achieved by McNicholas and Murphy (008), who applied a class of parsimonious Gaussian mixture models (PGMM) based on mixtures of factor analyzers; their selected model has q = latent factors and it is denoted as CUU (see McNicholas and Murphy, 008, Table ). We note, however, that, despite the improvement on accuracy, PGMM selects a wrong number of clusters (see Table 5). Table 5: Model-based clustering results for some models fitted to the wine data. Method Number of Model G ARI features GMM VEI GMM (best subset ) 5 VEV 0.78 PGMM CUU GMMDR 5 VEV 0.85 (Malic, Proline, Flavanoids, Intensity, OD80) We applied the GMMDR approach to this data set and we obtained a final VEV model with three clusters estimated on five GMMDR directions. This model yields an adjusted Rand index equal to 0.85 (see Table 5) and the resulting confusion matrix is reported in Table 6. As Figure 5 shows, the first two directions mainly exhibit differences in cluster means, whereas the remaining directions mostly represent differences in cluster variances. We also note that the first three GMMDR directions appear to contain most of the clustering information. Figure 6 shows a static view of a rotating D spin plot for the first three GMMDR directions with data points marked according to cluster membership; Barbera wines () appear to be well separated from the others along the first direction, while Barolo () and Grignolino () wines present a significant separation on the plane identified by the second and third directions. 5

16 Cluster Wines Barolo 57 0 Grignolino 64 5 Barbera Table 6: A classification table for the best model found using GMMDR on the wine data. Eigenvalues Eigenvalues Means contrib. Vars contrib. 4 5 Directions Figure 5: Plot of eigenvalues for each GMMDR direction with contributions from means and variances differences among clusters for the wine data. 5. Australian crabs data This data set contains measurements recorded on 00 specimens of Leptograpsus variegatus crabs on the shore in Western Australia (Campbell and Mahon, 974). Crabs are classified according to their colour form (blue and orange) and sex. There are 50 specimens of each sex of each species. Each specimen has 5 measurements: frontal lobe size (FL), rear width (RW), carapace length (CL) and width (CW), body depth (BD). The main interest is to distinguish between species and sex based on the available morphological measurements. Gaussian mixture modelling performs poorly for this dataset. The model with the highest BIC selects 9 components, which gives a small adjusted Rand index. If we constrain the number of components to match the actual number of classes, the performance is still poor (see Table 7). Raftery and Dean (006) obtained a large improvement through variable selection, ending up with a mixture model based on four variables. This dataset was also analyzed by Bouveyron et al (007, pp ) who obtained a 5% error rate with the high-dimensional data clustering (HDDC) model [a i b i Q i d i ]. Starting with the best unconstrained mixture model (EEE, G = 9) we estimated the GMMDR directions and then we performed feature selection. The final GMMDR model is fitted on three directions with 4 clusters having ellipsoidal, equal volume and shape but different orientation (EEV). The error rate is 7.50%, the same obtained by Raftery and Dean (006). 6

17 Dir Dir Dir Figure 6: A static view of a D spinning plot of the first three GMMDR directions for the wine data. Data points are marked according to cluster membership; the clusters identified refer mainly to () Barolo, () Grignolino, and () Barbera. The resulting accuracy is slightly worst than that for the HDDC model of Bouveyron et al (007), but it has the advantage of providing a visualization of the clustering structure. Table 7: Model-based clustering results for some models fitted to the crabs data. Method Model G Error ARI rate (%) GMM (unconstrained) EEE GMM (constrained) VEV GMM (best subset ) EEV HDDC GMMDR [a i b i Q i d i ] EEV (CW, RW, FL, BD) Summary plots (not shown here) indicate that the first direction mainly separates blue from orange type of crabs, while the second reflects sex differences. The last direction adds only marginal information contrasting blue and orange crabs within sex. Figure 7 plots the uncertainty u i = max g (ẑ ig ) projected onto the first two GMMDR directions, where ẑ ig is the estimated conditional probability that observation i belongs to group g. Misclassified crabs are indicated by a square surrounding the point. As it can be seen, orange and blue crabs appear well separated, whereas discrimination of sex within crabs species is more problematic: some points lie in the regions of higher uncertainty (darker areas), in particular for blue crabs. It is interesting to note that these aspects can also be read-off from the corresponding confusion matrix (see Table 8). 7

18 Cluster Species 4 BF BM OF OM Table 8: A classification table for the final GMMDR model fitted on the crabs data. 6 Conclusion In this paper we addressed the problem of dimension reduction for model-based clustering from the point of view of visualizing the clustering structure. Often, principal component analysis is used as a pre-processing step, and a clustering procedure is applied to the leading principal components. Probabilistic principal components and factor analyzers have been also used for dimension reduction. However, these do not necessarily capture the underlying clustering structure. Unlike other approaches (Tipping and Bishop, 999a,b; McLachlan et al, 00; Bouveyron et al, 007; McNicholas and Murphy, 008), the proposed method is not connected to any underlying latent variables structure, nor an EM-like algorithm is employed in the estimation. Instead, we aim at identifying the subspace which capture most of the clustering information contained in the data. Estimation of directions spanning such subspace is based on the eigendecomposition of a suitable kernel matrix, with unknown parameters estimated from a GMM selected as the best description of the data at hand. Visualization techniques can then be applied using the leading directions in order to obtain summary plots. The estimated directions can be further reduced by recasting the problem of feature selection as a model choice problem and using BIC to approximate Bayes factors. A forward greedy search algorithm is discussed to avoid the search over all possible subsets. The proposed methodology has been applied to simulated and real data sets. In all cases we were able to identify a dimension reduced subspace while preserving the cluster information contained in the original variables. Moreover, simulation studies showed that a suitable subset of directions allows for significant improvement in cluster identification. The improvement is larger when within-cluster covariances are different and redundant or noise variables are present. Finally, it is straightforward to extend the proposed approach in a supervised context, where the class membership for each unit is known in advance. Appendix A. Proofs Proof of proposition Define the transformed data matrix as X = XC + a, where C is a (p p) nonsingular matrix, a is a p-vector of real values, and is a n-vector of ones. It is easy to show that in the transformed scale Σ = C ΣC and the kernel matrix is given by M = C MC. The generalized eigendecompostion in the transformed scale satisfy the equation MCV = ΣCV L, so CV must be equal to the matrix V of eigenvalues of the generalized eigendecomposition of M with respect to Σ, and L is equal to the corresponding diagonal matrix of eigenvalues. Hence, V = C V, L = L, and the GMMDR directions in the transformed scale are given by β = C β. Furthermore, the corresponding variates Z = X β = Xβ + a C β, so they are simply a shifted version of the GMMDR variates in the original scale. 8

Dimension Reduction for Model-based Clustering via Mixtures of Multivariate t-distributions

Dimension Reduction for Model-based Clustering via Mixtures of Multivariate t-distributions Dimension Reduction for Model-based Clustering via Mixtures of Multivariate t-distributions by Katherine Morris A Thesis presented to The University of Guelph In partial fulfilment of requirements for

More information

Parsimonious Gaussian Mixture Models

Parsimonious Gaussian Mixture Models Parsimonious Gaussian Mixture Models Brendan Murphy Department of Statistics, Trinity College Dublin, Ireland. East Liguria West Liguria Umbria North Apulia Coast Sardina Inland Sardinia South Apulia Calabria

More information

Journal of Statistical Software

Journal of Statistical Software JSS Journal of Statistical Software January 2012, Volume 46, Issue 6. http://www.jstatsoft.org/ HDclassif: An R Package for Model-Based Clustering and Discriminant Analysis of High-Dimensional Data Laurent

More information

Model-based cluster analysis: a Defence. Gilles Celeux Inria Futurs

Model-based cluster analysis: a Defence. Gilles Celeux Inria Futurs Model-based cluster analysis: a Defence Gilles Celeux Inria Futurs Model-based cluster analysis Model-based clustering (MBC) consists of assuming that the data come from a source with several subpopulations.

More information

Families of Parsimonious Finite Mixtures of Regression Models arxiv: v1 [stat.me] 2 Dec 2013

Families of Parsimonious Finite Mixtures of Regression Models arxiv: v1 [stat.me] 2 Dec 2013 Families of Parsimonious Finite Mixtures of Regression Models arxiv:1312.0518v1 [stat.me] 2 Dec 2013 Utkarsh J. Dang and Paul D. McNicholas Department of Mathematics & Statistics, University of Guelph

More information

Overview. and data transformations of gene expression data. Toy 2-d Clustering Example. K-Means. Motivation. Model-based clustering

Overview. and data transformations of gene expression data. Toy 2-d Clustering Example. K-Means. Motivation. Model-based clustering Model-based clustering and data transformations of gene expression data Walter L. Ruzzo University of Washington UW CSE Computational Biology Group 2 Toy 2-d Clustering Example K-Means? 3 4 Hierarchical

More information

Variable selection for model-based clustering

Variable selection for model-based clustering Variable selection for model-based clustering Matthieu Marbac (Ensai - Crest) Joint works with: M. Sedki (Univ. Paris-sud) and V. Vandewalle (Univ. Lille 2) The problem Objective: Estimation of a partition

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Machine Learning (Spring 2012) Principal Component Analysis

Machine Learning (Spring 2012) Principal Component Analysis 1-71 Machine Learning (Spring 1) Principal Component Analysis Yang Xu This note is partly based on Chapter 1.1 in Chris Bishop s book on PRML and the lecture slides on PCA written by Carlos Guestrin in

More information

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

Introduction to Machine Learning

Introduction to Machine Learning 10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what

More information

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Principal component analysis (PCA) for clustering gene expression data

Principal component analysis (PCA) for clustering gene expression data Principal component analysis (PCA) for clustering gene expression data Ka Yee Yeung Walter L. Ruzzo Bioinformatics, v17 #9 (2001) pp 763-774 1 Outline of talk Background and motivation Design of our empirical

More information

Unsupervised machine learning

Unsupervised machine learning Chapter 9 Unsupervised machine learning Unsupervised machine learning (a.k.a. cluster analysis) is a set of methods to assign objects into clusters under a predefined distance measure when class labels

More information

Sufficient Dimension Reduction using Support Vector Machine and it s variants

Sufficient Dimension Reduction using Support Vector Machine and it s variants Sufficient Dimension Reduction using Support Vector Machine and it s variants Andreas Artemiou School of Mathematics, Cardiff University @AG DANK/BCS Meeting 2013 SDR PSVM Real Data Current Research and

More information

Linear Dimensionality Reduction

Linear Dimensionality Reduction Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Principal Component Analysis 3 Factor Analysis

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information

The Pennsylvania State University The Graduate School Eberly College of Science VISUAL ANALYTICS THROUGH GAUSSIAN MIXTURE MODELS WITH

The Pennsylvania State University The Graduate School Eberly College of Science VISUAL ANALYTICS THROUGH GAUSSIAN MIXTURE MODELS WITH The Pennsylvania State University The Graduate School Eberly College of Science VISUAL ANALYTICS THROUGH GAUSSIAN MIXTURE MODELS WITH SUBSPACE CONSTRAINED COMPONENT MEANS A Thesis in Statistics by Mu Qiao

More information

PCA and admixture models

PCA and admixture models PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57 Announcements HW1

More information

PCA, Kernel PCA, ICA

PCA, Kernel PCA, ICA PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per

More information

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides Intelligent Data Analysis and Probabilistic Inference Lecture

More information

Dimensionality Reduction

Dimensionality Reduction Dimensionality Reduction Le Song Machine Learning I CSE 674, Fall 23 Unsupervised learning Learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data where a classification of

More information

CS534 Machine Learning - Spring Final Exam

CS534 Machine Learning - Spring Final Exam CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the

More information

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inria.fr http://perception.inrialpes.fr/

More information

Clustering VS Classification

Clustering VS Classification MCQ Clustering VS Classification 1. What is the relation between the distance between clusters and the corresponding class discriminability? a. proportional b. inversely-proportional c. no-relation Ans:

More information

Maximum variance formulation

Maximum variance formulation 12.1. Principal Component Analysis 561 Figure 12.2 Principal component analysis seeks a space of lower dimensionality, known as the principal subspace and denoted by the magenta line, such that the orthogonal

More information

Table of Contents. Multivariate methods. Introduction II. Introduction I

Table of Contents. Multivariate methods. Introduction II. Introduction I Table of Contents Introduction Antti Penttilä Department of Physics University of Helsinki Exactum summer school, 04 Construction of multinormal distribution Test of multinormality with 3 Interpretation

More information

Degenerate Expectation-Maximization Algorithm for Local Dimension Reduction

Degenerate Expectation-Maximization Algorithm for Local Dimension Reduction Degenerate Expectation-Maximization Algorithm for Local Dimension Reduction Xiaodong Lin 1 and Yu Zhu 2 1 Statistical and Applied Mathematical Science Institute, RTP, NC, 27709 USA University of Cincinnati,

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Dimensionality Reduction and Principal Components

Dimensionality Reduction and Principal Components Dimensionality Reduction and Principal Components Nuno Vasconcelos (Ken Kreutz-Delgado) UCSD Motivation Recall, in Bayesian decision theory we have: World: States Y in {1,..., M} and observations of X

More information

The generative approach to classification. A classification problem. Generative models CSE 250B

The generative approach to classification. A classification problem. Generative models CSE 250B The generative approach to classification The generative approach to classification CSE 250B The learning process: Fit a probability distribution to each class, individually To classify a new point: Which

More information

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means

More information

High Dimensional Discriminant Analysis

High Dimensional Discriminant Analysis High Dimensional Discriminant Analysis Charles Bouveyron 1,2, Stéphane Girard 1, and Cordelia Schmid 2 1 LMC IMAG, BP 53, Université Grenoble 1, 38041 Grenoble cedex 9 France (e-mail: charles.bouveyron@imag.fr,

More information

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means

More information

Gaussian Mixture Models with Component Means Constrained in Pre-selected Subspaces

Gaussian Mixture Models with Component Means Constrained in Pre-selected Subspaces Gaussian Mixture Models with Component Means Constrained in Pre-selected Subspaces Mu Qiao and Jia Li Abstract We investigate a Gaussian mixture model (GMM) with component means constrained in a pre-selected

More information

Bayesian Decision Theory

Bayesian Decision Theory Bayesian Decision Theory Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2017 CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 1 / 46 Bayesian

More information

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering

More information

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu Dimension Reduction Techniques Presented by Jie (Jerry) Yu Outline Problem Modeling Review of PCA and MDS Isomap Local Linear Embedding (LLE) Charting Background Advances in data collection and storage

More information

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM. Université du Sud Toulon - Var Master Informatique Probabilistic Learning and Data Analysis TD: Model-based clustering by Faicel CHAMROUKHI Solution The aim of this practical wor is to show how the Classification

More information

Mathematical foundations - linear algebra

Mathematical foundations - linear algebra Mathematical foundations - linear algebra Andrea Passerini passerini@disi.unitn.it Machine Learning Vector space Definition (over reals) A set X is called a vector space over IR if addition and scalar

More information

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inrialpes.fr http://perception.inrialpes.fr/ Outline of Lecture

More information

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012 Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Principal Components Analysis Le Song Lecture 22, Nov 13, 2012 Based on slides from Eric Xing, CMU Reading: Chap 12.1, CB book 1 2 Factor or Component

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 8 Continuous Latent Variable

More information

Robustness of Principal Components

Robustness of Principal Components PCA for Clustering An objective of principal components analysis is to identify linear combinations of the original variables that are useful in accounting for the variation in those original variables.

More information

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin 1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)

More information

CS281 Section 4: Factor Analysis and PCA

CS281 Section 4: Factor Analysis and PCA CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we

More information

Discriminant analysis and supervised classification

Discriminant analysis and supervised classification Discriminant analysis and supervised classification Angela Montanari 1 Linear discriminant analysis Linear discriminant analysis (LDA) also known as Fisher s linear discriminant analysis or as Canonical

More information

Introduction to Graphical Models

Introduction to Graphical Models Introduction to Graphical Models The 15 th Winter School of Statistical Physics POSCO International Center & POSTECH, Pohang 2018. 1. 9 (Tue.) Yung-Kyun Noh GENERALIZATION FOR PREDICTION 2 Probabilistic

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu November 3, 2015 Methods to Learn Matrix Data Text Data Set Data Sequence Data Time Series Graph

More information

Extending mixtures of multivariate t-factor analyzers

Extending mixtures of multivariate t-factor analyzers Stat Comput (011) 1:361 373 DOI 10.1007/s11-010-9175- Extending mixtures of multivariate t-factor analyzers Jeffrey L. Andrews Paul D. McNicholas Received: July 009 / Accepted: 1 March 010 / Published

More information

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation Clustering by Mixture Models General bacground on clustering Example method: -means Mixture model based clustering Model estimation 1 Clustering A basic tool in data mining/pattern recognition: Divide

More information

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent Unsupervised Machine Learning and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 7 Jan-Willem van de Meent DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Dimensionality Reduction Goal:

More information

Factor Analysis (10/2/13)

Factor Analysis (10/2/13) STA561: Probabilistic machine learning Factor Analysis (10/2/13) Lecturer: Barbara Engelhardt Scribes: Li Zhu, Fan Li, Ni Guan Factor Analysis Factor analysis is related to the mixture models we have studied.

More information

STATISTICAL LEARNING SYSTEMS

STATISTICAL LEARNING SYSTEMS STATISTICAL LEARNING SYSTEMS LECTURE 8: UNSUPERVISED LEARNING: FINDING STRUCTURE IN DATA Institute of Computer Science, Polish Academy of Sciences Ph. D. Program 2013/2014 Principal Component Analysis

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Machine Learning 2nd Edition

Machine Learning 2nd Edition INTRODUCTION TO Lecture Slides for Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some parts from http://www.cs.tau.ac.il/~apartzin/machinelearning/ The MIT Press, 2010

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

1 Data Arrays and Decompositions

1 Data Arrays and Decompositions 1 Data Arrays and Decompositions 1.1 Variance Matrices and Eigenstructure Consider a p p positive definite and symmetric matrix V - a model parameter or a sample variance matrix. The eigenstructure is

More information

Dimensionality Reduction and Principle Components

Dimensionality Reduction and Principle Components Dimensionality Reduction and Principle Components Ken Kreutz-Delgado (Nuno Vasconcelos) UCSD ECE Department Winter 2012 Motivation Recall, in Bayesian decision theory we have: World: States Y in {1,...,

More information

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ). .8.6 µ =, σ = 1 µ = 1, σ = 1 / µ =, σ =.. 3 1 1 3 x Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ ). The Gaussian distribution Probably the most-important distribution in all of statistics

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

Latent Variable Models and EM Algorithm

Latent Variable Models and EM Algorithm SC4/SM8 Advanced Topics in Statistical Machine Learning Latent Variable Models and EM Algorithm Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/atsml/

More information

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26 Principal Component Analysis Brett Bernstein CDS at NYU April 25, 2017 Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 1 / 26 Initial Question Intro Question Question Let S R n n be symmetric. 1

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction

More information

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University

More information

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan Lecture 3: Latent Variables Models and Learning with the EM Algorithm Sam Roweis Tuesday July25, 2006 Machine Learning Summer School, Taiwan Latent Variable Models What to do when a variable z is always

More information

MACHINE LEARNING ADVANCED MACHINE LEARNING

MACHINE LEARNING ADVANCED MACHINE LEARNING MACHINE LEARNING ADVANCED MACHINE LEARNING Recap of Important Notions on Estimation of Probability Density Functions 2 2 MACHINE LEARNING Overview Definition pdf Definition joint, condition, marginal,

More information

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University PRINCIPAL COMPONENT ANALYSIS DIMENSIONALITY

More information

Vector Space Models. wine_spectral.r

Vector Space Models. wine_spectral.r Vector Space Models 137 wine_spectral.r Latent Semantic Analysis Problem with words Even a small vocabulary as in wine example is challenging LSA Reduce number of columns of DTM by principal components

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

Chemometrics: Classification of spectra

Chemometrics: Classification of spectra Chemometrics: Classification of spectra Vladimir Bochko Jarmo Alander University of Vaasa November 1, 2010 Vladimir Bochko Chemometrics: Classification 1/36 Contents Terminology Introduction Big picture

More information

A pairwise likelihood approach to simultaneous clustering and dimensional reduction of ordinal data

A pairwise likelihood approach to simultaneous clustering and dimensional reduction of ordinal data arxiv:1504.02913v1 [stat.me] 11 Apr 2015 A pairwise likelihood approach to simultaneous clustering and dimensional reduction of ordinal data Monia Ranalli Roberto Rocci Abstract The literature on clustering

More information

Lecture 7: Con3nuous Latent Variable Models

Lecture 7: Con3nuous Latent Variable Models CSC2515 Fall 2015 Introduc3on to Machine Learning Lecture 7: Con3nuous Latent Variable Models All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/

More information

Clustering using Mixture Models

Clustering using Mixture Models Clustering using Mixture Models The full posterior of the Gaussian Mixture Model is p(x, Z, µ,, ) =p(x Z, µ, )p(z )p( )p(µ, ) data likelihood (Gaussian) correspondence prob. (Multinomial) mixture prior

More information

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017 CPSC 340: Machine Learning and Data Mining More PCA Fall 2017 Admin Assignment 4: Due Friday of next week. No class Monday due to holiday. There will be tutorials next week on MAP/PCA (except Monday).

More information

Model-based clustering of high-dimensional data: an overview and some recent advances

Model-based clustering of high-dimensional data: an overview and some recent advances Model-based clustering of high-dimensional data: an overview and some recent advances Charles BOUVEYRON Laboratoire SAMM, EA 4543 Université Paris 1 Panthéon-Sorbonne This presentation is based on several

More information

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation) Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation) PCA transforms the original input space into a lower dimensional space, by constructing dimensions that are linear combinations

More information

Model selection criteria in Classification contexts. Gilles Celeux INRIA Futurs (orsay)

Model selection criteria in Classification contexts. Gilles Celeux INRIA Futurs (orsay) Model selection criteria in Classification contexts Gilles Celeux INRIA Futurs (orsay) Cluster analysis Exploratory data analysis tools which aim is to find clusters in a large set of data (many observations

More information

ON THE CALCULATION OF A ROBUST S-ESTIMATOR OF A COVARIANCE MATRIX

ON THE CALCULATION OF A ROBUST S-ESTIMATOR OF A COVARIANCE MATRIX STATISTICS IN MEDICINE Statist. Med. 17, 2685 2695 (1998) ON THE CALCULATION OF A ROBUST S-ESTIMATOR OF A COVARIANCE MATRIX N. A. CAMPBELL *, H. P. LOPUHAA AND P. J. ROUSSEEUW CSIRO Mathematical and Information

More information

Approximating the Covariance Matrix with Low-rank Perturbations

Approximating the Covariance Matrix with Low-rank Perturbations Approximating the Covariance Matrix with Low-rank Perturbations Malik Magdon-Ismail and Jonathan T. Purnell Department of Computer Science Rensselaer Polytechnic Institute Troy, NY 12180 {magdon,purnej}@cs.rpi.edu

More information

Simulation study on using moment functions for sufficient dimension reduction

Simulation study on using moment functions for sufficient dimension reduction Michigan Technological University Digital Commons @ Michigan Tech Dissertations, Master's Theses and Master's Reports - Open Dissertations, Master's Theses and Master's Reports 2012 Simulation study on

More information

Principal Component Analysis

Principal Component Analysis Principal Component Analysis Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison [based on slides from Nina Balcan] slide 1 Goals for the lecture you should understand

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

INRIA Rh^one-Alpes. Abstract. Friedman (1989) has proposed a regularization technique (RDA) of discriminant analysis

INRIA Rh^one-Alpes. Abstract. Friedman (1989) has proposed a regularization technique (RDA) of discriminant analysis Regularized Gaussian Discriminant Analysis through Eigenvalue Decomposition Halima Bensmail Universite Paris 6 Gilles Celeux INRIA Rh^one-Alpes Abstract Friedman (1989) has proposed a regularization technique

More information

Generative modeling of data. Instructor: Taylor Berg-Kirkpatrick Slides: Sanjoy Dasgupta

Generative modeling of data. Instructor: Taylor Berg-Kirkpatrick Slides: Sanjoy Dasgupta Generative modeling of data Instructor: Taylor BergKirkpatrick Slides: Sanjoy Dasgupta Parametric versus nonparametric classifiers Nearest neighbor classification: size of classifier size of data set Nonparametric:

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

15 Singular Value Decomposition

15 Singular Value Decomposition 15 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing

More information

PCA and LDA. Man-Wai MAK

PCA and LDA. Man-Wai MAK PCA and LDA Man-Wai MAK Dept. of Electronic and Information Engineering, The Hong Kong Polytechnic University enmwmak@polyu.edu.hk http://www.eie.polyu.edu.hk/ mwmak References: S.J.D. Prince,Computer

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x

More information

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 42 Outline 1 Introduction 2 Feature selection

More information

applications Rome, 9 February Università di Roma La Sapienza Robust model based clustering: methods and applications Francesco Dotto Introduction

applications Rome, 9 February Università di Roma La Sapienza Robust model based clustering: methods and applications Francesco Dotto Introduction model : fuzzy model : Università di Roma La Sapienza Rome, 9 February Outline of the presentation model : fuzzy 1 General motivation 2 algorithm on trimming and reweigthing. 3 algorithm on trimming and

More information

Principal Components Analysis. Sargur Srihari University at Buffalo

Principal Components Analysis. Sargur Srihari University at Buffalo Principal Components Analysis Sargur Srihari University at Buffalo 1 Topics Projection Pursuit Methods Principal Components Examples of using PCA Graphical use of PCA Multidimensional Scaling Srihari 2

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

Machine Learning - MT & 14. PCA and MDS

Machine Learning - MT & 14. PCA and MDS Machine Learning - MT 2016 13 & 14. PCA and MDS Varun Kanade University of Oxford November 21 & 23, 2016 Announcements Sheet 4 due this Friday by noon Practical 3 this week (continue next week if necessary)

More information

Unsupervised Learning: K- Means & PCA

Unsupervised Learning: K- Means & PCA Unsupervised Learning: K- Means & PCA Unsupervised Learning Supervised learning used labeled data pairs (x, y) to learn a func>on f : X Y But, what if we don t have labels? No labels = unsupervised learning

More information

LECTURE NOTE #10 PROF. ALAN YUILLE

LECTURE NOTE #10 PROF. ALAN YUILLE LECTURE NOTE #10 PROF. ALAN YUILLE 1. Principle Component Analysis (PCA) One way to deal with the curse of dimensionality is to project data down onto a space of low dimensions, see figure (1). Figure

More information