Dimension Reduction and Clustering of High. Dimensional Data using a Mixture of Generalized. Hyperbolic Distributions

Size: px
Start display at page:

Download "Dimension Reduction and Clustering of High. Dimensional Data using a Mixture of Generalized. Hyperbolic Distributions"

Transcription

1 Dimension Reduction and Clustering of High Dimensional Data using a Mixture of Generalized Hyperbolic Distributions

2 DIMENSION REDUCTION AND CLUSTERING OF HIGH DIMENSIONAL DATA USING A MIXTURE OF GENERALIZED HYPERBOLIC DISTRIBUTIONS BY THINESH PATHMANATHAN, B.Sc. a thesis submitted to the department of mathematics & statistics and the school of graduate studies of mcmaster university in partial fulfilment of the requirements for the degree of Master of Science c Copyright by Thinesh Pathmanathan, March 2018 All Rights Reserved

3 Master of Science (2018) (Mathematics & Statistics) McMaster University Hamilton, Ontario, Canada TITLE: Dimension Reduction and Clustering of High Dimensional Data using a Mixture of Generalized Hyperbolic Distributions AUTHOR: Thinesh Pathmanathan B.Sc., (Mathematics and Statistics) University of Toronto, Toronto, Canada SUPERVISOR: Dr. Sharon McNicholas NUMBER OF PAGES: viii, 35 ii

4 To my family and friends for their endless support throughout this endeavor.

5 Abstract Model-based clustering is a probabilistic approach that views each cluster as a component in an appropriate mixture model. The Gaussian mixture model is one of the most widely used model-based methods. However, this model tends to perform poorly when clustering high-dimensional data due to the over-parametrized solutions that arise in high-dimensional spaces. This work instead considers the approach of combining dimension reduction techniques with clustering via a mixture of generalized hyperbolic distributions. The dimension reduction techniques, principal component analysis and factor analysis along with their extensions were reviewed. Then the aforementioned dimension reduction techniques were individually paired with the mixture of generalized hyperbolic distributions in order to demonstrate the clustering performance achieved under each method using both simulated and real data sets. For a majority of the data sets, the clustering method utilizing principal component analysis exhibited better classification results compared to the clustering method based on the extending the factor analysis model. iv

6 Acknowledgements First and foremost, I have to thank my research supervisor, Dr. Sharon McNicholas, without her assistance and dedicated involvement in every step throughout the process, this paper would have never been accomplished. I would like to thank you very much for your support and understanding over the past year. I would also like to thank Dr. Roman Viveros-Aguilera and Dr. Paul McNicholas for taking the time to be to serve as members of my examining committee. Additionally, I would like to thank Dr. Paul McNicholas for providing most of the R code that I used for this analysis. v

7 Contents Abstract iv Acknowledgements v 1 Introduction 1 2 Content Mixture Models Mixtures of Multivariate Gaussian distributions Generalized Inverse Gaussian Distributions Mixtures of Generalized Hyperbolic Distributions Principal Component Analysis Mixtures of Generalized Hyperbolic Factor Analyzers Factor Analysis Mixtures of Factor Analyzers vi

8 3 Methodology Likelihood Parameter Estimation EM Algorithm AECM Algorithm Method Initialization Convergence Predicted Classifications Model Selection Performance Assessment Application Simulated Studies Twonorm Data Ringworm Data Waveform Data Real Data Wine Data Satellite data Conclusions 26 A Bessel Functions 28 Bibliography 30 vii

9 List of Figures 2.1 Densities of GIG distributions with various parameterizations Projection of the clustered twonorm data in the 2 first principal components of PCA Projection of the clustered ringnorm data in the 2 first principal components of PCA Projection of the clustered waveform data in the 2 first principal components of PCA viii

10 Chapter 1 Introduction Classification is a procedure in which group membership labels are assigned to unlabeled observations. A group can be a class or a cluster. Having prior knowledge of observation labels and the degree in which this information is used, can lead to the following three types of classification: supervised, semi-supervised, and unsupervised (McNicholas, 2016). Cluster analysis is unsupervised classification, where the labels for all observations are missing or are treated as such. A typical way to define a cluster is as a group of observations such that the observations in a particular group are more similar to one another than they are to the observations present in any alternative groups. Using such a definition is problematic because taken at the extreme each observation is defined as its own cluster (McNicholas, 2016). Instead it is more prudent if one thinks of a cluster as a component in a suitable mixture model (Tiedeman, 1955; Wolfe, 1963). The use of a mixture model or a family of mixture models for clustering is known as model-based clustering (McNicholas, 2016). The onset of the information age has resulted in large improvements to the amount of 1

11 data that can be stored and a rapid increase in the rate at which this data can be generated. Both these developments have resulted in the term Big Data being coined. Big Data sources are typically associated with high-dimensional variables of mixed type which are produced in rapid succession (Daas et al., 2015). The focus herein will be on model-based clustering of high-dimensional data. Clustering high-dimensional data using traditional model-based approaches has proven to be difficult due to a phenomenon known in literature as the curse of dimensionality (Bellman, 1957). This curse refers to the over-parametrized solutions that are returned from estimating parameters over a high-dimensional space (Bouveyron and Brunet-Saumard, 2014). The works by (Scott and Thompson, 1983; Huber, 1985) have demonstrated it is more pragmatic to implement clustering techniques in a lower dimensional space as opposed to the original space. Consequently, an appropriate measure to address the issue of dimensionality is considering dimension reduction techniques. This thesis is a continuation of the earlier work begun by Bouveyron and Brunet- Saumard (2014), who used Gaussian mixture approaches to cluster high-dimensional data. This work further develops their contributions by pairing dimension reduction techniques alongside clustering using the mixture of hyperbolic distributions (MGHD). In particular, the approach of using principal component analysis for dimension reduction in addition to clustering using MGHD will be compared against the mixture of generalized hyperbolic factor analyzers (MGHFA) model. The mixture of generalized factor analyzers (MFA) model simultaneously clusters the data and reduces the dimensionality within each cluster locally. The MGHFA model is an extension of the MFA model which incorporates the generalized hyperbolic distribution. Model-based clustering will be carried out using the MGHD model in favor of 2

12 the conventional Gaussian mixture model because it is able to produce more favorable results when the clusters are asymmetrical or skewed. Furthermore, the MGHD exhibits a great deal of modelling flexibility as a result of its index parameter (see Chapter 2 for details). The remainder of this thesis will be organized as follows: in Chapter 2, the theory behind mixture models, relevant distributions, and popular dimension reduction techniques will be discussed. In Chapter 3, parameter estimation, model selection, and classification assessment will be overviewed. In Chapter 4, R software (R Core Team, 2017) will be used to demonstrate the classification performance achieved when using dimension reduction approaches in conjunction with model-based clustering methods. Lastly, Chapters 5, will end with a general discussion on the highlighted methods and suggestions for future work. 3

13 Chapter 2 Content 2.1 Mixture Models A finite mixture model is a convex linear combination of two or more probability density functions. Let X be a p-dimensional random vector. The probability density function of a mixture model is given by f(x ϑ) = G π g f g (x θ g ), (2.1) g=1 where f g (x θ g ) is the density function for the gth component, π g are the mixing proportions, and ϑ = (π 1, π 2,..., π G, θ 1, θ 2,..., θ G ) is the vector of parameters. The following assumptions are made: π g > 0 for g {1, 2,..., G}, and G g=1 π g = 1. By convention, the component densities f 1 (x θ 1 ), f 2 (x θ 2 ),..., f G (x θ G ) come from the same family of distributions and f(x ϑ) is known as the G-component finite mixture density (McNicholas, 2016). 4

14 2.2 Mixtures of Multivariate Gaussian distributions Gaussian mixture models (GMMs) are widely used in clustering applications because data following a mixture of multivariate normal densities have spherical clusters centered around their means with higher density for points closer to the mean (Fraley and Raftery, 2002). The density function for a GMM can be written as follows: f(x ϑ) = G π g φ(x µ g, Σ g ), (2.2) g=1 where φ(x µ g, Σ g ) = { 1 (2π)p Σ g exp 1 } 2 (x µ g) Σ 1 g (x µ g ) (2.3) is the probability density function of a multivariate Gaussian distribution with mean µ g and variance-covariance matrix Σ g. Parameter estimation for the GMM model is carried out using the expectation-maximization algorithm. GMMs perform poorly when the data exhibits departures from the assumptions of normality such as skewness or outliers (Juárez and Steel, 2010). 2.3 Generalized Inverse Gaussian Distributions The generalized inverse Gaussian (GIG) distribution was first formalized by Good (1953) but its development can be traced back to Halphen (1941). At the time of its development, the GIG was known as the Halphen Type A distribution. Barndorff- Nielsen and Halgreen (1977) outlined the statistical properties that resulted in the 5

15 GIG distribution that is used today. Consider a random variable W following a GIG distribution with parameters a, b R +, λ R, then its probability density function for w > 0 can be written as: q(w a, b, λ) = (a/b)λ/2 w λ 1 2ηK λ ab { exp } aw + b/w, (2.4) 2 where K λ is the modified Bessel function of the third kind (see Appendix A for details) with index parameter λ, and the parameters a and b control concentration via ab and scaling via a/b (Koudou et al., 2014; Browne and McNicholas, 2015). Figure 2.1: Densities of GIG distributions with various parameterizations. Setting ψ = ab and η = a/b can result in a more interpretable parameterization which expresses the concentration and scale parameters explicitly (Koudou et al., 6

16 2014; Browne and McNicholas, 2015). This parameterization is represented by: { h(w ψ, a, b, λ) = (w/η)λ 1 2ηK λ (ψ) exp ψ ( w 2 η + η )}. (2.5) w The GIG is a highly versatile distribution with special cases consisting of widely used distributions (Jorgensen, 2012; Browne and McNicholas, 2015), i.e., the gamma distribution (b > 0, λ > 0) and the inverse Gaussian distribution (λ = 1/2). 2.4 Mixtures of Generalized Hyperbolic Distributions The generalized hyperbolic (GH) distribution was introduced by Barndorff-Nielsen (1977) with the intention of modeling aeolian sand deposits and dune movements. McNeil et al. (2015) has written the probability density of the GH distribution as follows: [ ] (λ p/2)/2 χ + δ(x, µ ) f(x ϑ) = ψ + γ 1 γ ( ) [ψ [ψ/χ] λ/2 K λ p/2 + γ 1 γ][χ + δ(x, µ )] (2π) p/2 1/2 K λ (, (2.6) χψ) exp{(µ x) 1 γ} where ϑ = (µ, γ,, λ, χ, ψ, ) is the set of parameters. K λ is the modified Bessel function of the third kind with index parameter λ. The location and skewness parameters are given by µ and γ respectively, while the concentration parameters are given by χ and ψ. The squared Mahalanobis distance between x and µ is denoted by δ(x, µ ) = (x µ) 1 (x µ), where is the positive definite scale matrix with 7

17 the constraint = 1. The GH distribution can be defined in terms of a normal variance-mean mixture with a GIG mixing distribution using the following relation: X = µ + W γ + W V, (2.7) where W GIG(ψ, χ, λ) is a GIG random variable and V N(0, ) is a latent multivariate Gaussian random variable (Hammerstein, 2010; McNicholas, 2016). Analogous to the GIG distribution several distributions can be recovered as special or limiting cases of the GH distribution. Some well known examples include the skew-t (Murray et al., 2014) the variance-gamma (McNicholas et al., 2017), and the normalinverse Gaussian (Karlis and Santourian, 2009). Browne and McNicholas (2015) as well as Hu (2005) noted that using the GH distribution for the purposes of modelbased clustering requires relaxing the constraint = 1. However, relaxing this assumption results in the model being unidentifiable. This means that two or more parametrizations correspond to the same probability distribution making inference about the true parameters difficult. Browne and McNicholas (2015) introduced the parametrization ω = ψχ and η = χ/ψ. This leads to another parametrization of the GH distribution: [ ] (λ p/2)/2 ω + δ(x, µ Σ) f(x ϑ) = ω + α Σ 1 α ( ) [ω K λ p/2 + α Σ 1 α][χ + δ(x, µ Σ)] (2π) p/2 Σ 1/2 K λ ( ω) exp{(µ x) Σ 1 α}, (2.8) where Σ is the scale matrix and α is the skewness parameter. Browne and McNicholas (2015) use the parametrization in (2.8) for the mixture of generalized hyperbolic 8

18 distributions (MGHD). Parameter estimation for the MGHD model is carried out using the expectation-maximization algorithm. 2.5 Principal Component Analysis Principal component analysis (PCA) was first proposed by Pearson (1901) and was further developed by Hotelling (1933). Hotelling described it as a method which aimed at reducing the dimensionality of the data while retaining as much of the variability as possible. The classical view of principal components is as orthogonal linear transformations of the original data (Bouveyron and Brunet-Saumard, 2014). Consider a random vector X = (X 1, X 2,..., X p ) with covariance matrix Σ. Let λ 1 λ 2... λ p 0 be the ordered eigenvalues of Σ, and let e 1, e 2,..., e p be the corresponding eigenvectors. Then the ith principal component can be written as: Y i = e i1 X 1 + e i2 X e ip X p, (2.9) for i = 1, 2,..., p. Where the ith principal component is the linear combination of the X i s with maximum variance conditional on it being orthogonal to the first i 1 components (Johnson and Wichern, 2007). The variance of the kth principal component is given by λ k. Consequently, the proportion of total variation explained by the first k principal components is given by: k i=1 λ i p i=1 λ. (2.10) i 9

19 An important consideration when using PCA for statistical analysis is determining how many principal components to retain in the model. This is typically addressed by considering the amount of total sample variance explained (Johnson and Wichern, 2007). 2.6 Mixtures of Generalized Hyperbolic Factor Analyzers Factor Analysis Factor analysis is another popular dimension reduction technique that was introduced by Spearman (1904) and its statistical properties were outlined by Bartlett (1953), and Lawley and Maxwell (1962). Factor analysis sets out to reduce the dimensionality of the p observed variables by substituting them with q latent factors where q < p. An important consideration is that the q latent factors must account for a sufficient amount of the variability originally explained by the p observed variables (McNicholas, 2016). Consider p-dimensional random variables X 1, X 2,..., X n. Assume there are q- dimensional latent factors U 1, U 2,..., U n, then the factor analysis model is given by: X i = µ + ΛU i + ɛ i, (2.11) for i = 1, 2,..., n, where Λ is a p q matrix of factor loadings. The latent factor is denoted by U i N(0, I q ), with error terms given by ɛ i N(0, Ψ), where Ψ is a p p diagonal positive definite matrix. Furthermore, it is assumed that both the U i and ɛ i are independently distributed, moreover, they are independent of each other. Hence, 10

20 the marginal distribution of X i under the factor analysis model is N(µ, ΛΛ + Ψ) Mixtures of Factor Analyzers The factor analysis approach can be extended to the mixture of factor analyzers model which assumes that: X i = µ g + Λ g U ig + ɛ ig, (2.12) with probability π g, for g = 1,..., G (Ghahramani and Hinton, 1997; McLachlan and Peel, 2000). This approach was further extended to develop the mixture of generalized hyperbolic factor analyzers model (MGHFA, see Tortora et al. (2015) for details). The density can be described by: f(x ϑ) = G π g f h (x θ g ), (2.13) g=1 where f h (x θ g ) is as defined in equation (2.8) and θ g = (µ g, Σ g, α g, ω g, λ g ) with scale matrix for component g defined by, Σ g = Λ g Λ g+ψ g. The parameters for the MGHFA model are estimated using the alternating expectation-conditional maximization algorithm. 11

21 Chapter 3 Methodology 3.1 Likelihood The likelihood function serves as a basis for making inferences about the unknown model parameters through utilizing maximum likelihood estimators. In general, the log-likelihood function is used to find the maximum likelihood estimates because it is easier to differentiate than the original likelihood function. Since log(x) is an increasing monotonic function, maximizing the log-likelihood is equivalent to maximizing the likelihood function. Consider n p-dimensional unlabelled observations (x 1, x 2,..., x n ) that we wish to separate into G similar groups. Let (z 1, z 2,..., z n ) represent the unobserved labels for each observation where z i = (z i1, z i2,..., z ig ). Then the model-based clustering log-likelihood for the set of observations (x 1, x 2,..., x n ) is given by, L(ϑ x) = n i=1 ( G ) log π g f(x i θ g ). (3.1) g=1 12

22 In this instance the goal of clustering is to determine the value of the missing label z i for each observation x i. This makes maximizing the log-likelihood difficult since information about the group labels (z 1, z 2,..., z n ) is missing. Together the sets (x 1, x 2,..., x n ) and (z 1, z 2,..., z n ) form the complete-data set (Bouveyron and Brunet- Saumard, 2014). The model-based clustering complete-data log-likelihood is given by, l c (ϑ; x, z) = n G z ig log(π g f(x i θ g )), (3.2) i=1 g=1 where z ig = 1 if the ith observation belongs to the gth cluster and z ig = 0 otherwise. The parameters in the proposed model can be estimated using the expectationmaximization (EM) algorithm (Dempster et al., 1977). The EM algorithm is one of the most popular methods for parameter estimation in the presence of missing data or unobserved variables. 3.2 Parameter Estimation EM Algorithm The EM algorithm is an iterative process that alternates sequentially between the E-step and the M-step, until convergence. The E-step consists of determining the expected value of the complete data log-likelihood given the observed data and the current parameter estimates. The expected log-likelihood of complete data is denoted by: E[l c (ϑ; x, z) ϑ (h) ] = G n z (h) ig log(π g f(x i θ g )), (3.3) g=1 i=1 13

23 where ϑ (h) denotes the estimate for ϑ after the hth iteration and z (h) ig = E[z ig = 1 x i, ϑ (h) ] (Bouveyron and Brunet-Saumard, 2014). During the M-step, the expected log-likelihood from the previous step is maximized and the current estimate of ϑ is updated for the next iteration. A remarkable feature about the EM algorithm is that each iteration guarantees the log-likelihood function must remain the same or show improvement until the desired stopping criterion is reached AECM Algorithm A variation of the EM algorithm is the expectation-conditional maximization (ECM) algorithm which was first established by Meng and Rubin (1993). Overall the ECM is similar to the EM algorithm but with one key distinction: the maximization step is replaced by multiple conditional maximization (CM) steps (McNicholas, 2016). A modified version of the ECM algorithm is alternating expectation-conditional maximization (AECM) algorithm (Meng and Van Dyk, 1997). This version of the algorithm allows the basis for incomplete data to vary for each conditional maximization step (Murray et al., 2014). The AECM algorithm is particularly useful for models like the mixture of hyperbolic factor analyzers model, which has its source of incomplete data stem from both the unknown group labels, {z 1, z 2,..., z n } and the unobserved factors, {u 1, u 2,..., u n } Method Initialization In consideration of detecting the global optimum it is necessary to start with a good choice the initial values (ϑ 0 ) for the model and for several iterations to be run; in order to compensate for the effects of random initializations (McLachlan et al., 2002). A 14

24 commonly used initialization method is the k-means clustering algorithm (Hartigan and Wong, 1979). The objective of the k-means algoritm is to partition a set of observations into k clusters such that the distances between points within clusters and the cluster centers are minimized Convergence One widely used stopping criterion for the EM algorithm is a measure based on lack of improvement in the log-likelihood. That is: l r+1 l r < ɛ, (3.4) where l r is the observed log-likelihood value for the rth iteration and ɛ is some arbitrarily small value. The stopping criterion for the AECM algorithm uses Aitken s acceleration (Aitken, 1926), which estimates the asymptotic maximum of the loglikelihood. Aitken s acceleration at iteration r can be written as: a r = l(r+1) l (r) l (r) l (r 1) (3.5) and the asymptotic estimate of the log-likelihood at iteration the r + 1 is: l (r+1) = l (r) a (r) (l(r+1) l (r) ), (3.6) McNicholas et al. (2010) considers the algorithm to have reached its exiting condition when, l r+1 l r < ɛ. (3.7) 15

25 3.3 Predicted Classifications Once the desired stopping criterion is achieved and parameter estimation is complete, the predicted group labels are retrieved from the a posteriori probability that observation x i belongs to the gth component: ẑ ig = ˆπ g f g (x i ˆϑ g ) G h=1 ˆπ hf h (x i ˆϑ h ), (3.8) for i = 1,..., n and g = 1,..., G. The numerical values for these a posteriori classifications can be designated as soft or as hard depending on the nature of the problem. Soft classifications consider the ẑ ig as each observation s probability of belonging to each component; where as hard classifications restrict the ẑ ig {0, 1} (McNicholas, 2016). In practice, if one wishes to convert the a posteriori classification from soft to hard, this can be done so using maximum a posteriori (MAP) classifications, where 1 if g = argmax h {ẑ ih }, MAP{ẑ ig } = 0 otherwise. (3.9) 3.4 Model Selection Penalized likelihood approaches such as the Bayesian information criterion (BIC; Schwarz (1978)) are used for determining the number of components used for modelbased clustering. BIC is given by: BIC = 2l( ˆϑ) ρlogn (3.10) 16

26 where ρ is the number of free parameters in the model, n is the number of observations, and l( ˆϑ) is the optimized log-likelihood value. The use of the BIC for selecting the number of components in a mixture model has been supported by Leroux (1992) and Keribin (2000). The BIC can be used for selecting the number of latent factors in the factor analysis model. Simulation studies by Lopes and West (2004) corroborate the use of BIC in selecting the number of latent factors. 3.5 Performance Assessment The adjusted Rand index (ARI), introduced by Hubert and Arabie (1985) is the typically used method for assessing classification performance in model-based clustering. The ARI stems from the Rand index (Rand, 1971), which is the ratio of pairwise agreements to the total number of pairs. The Rand index can take on any value from 0 to 1, where 1 is indicative of perfect class agreement. The main drawback with the Rand index is that it has a positive expected value under random classification. The ARI was designed to compensate for this and it is defined as: ARI = index expected index maximum index expected index. (3.11) Analogous to the Rand index, an ARI value of 1 refers to a perfect class agreement. However, this time the expected value of the ARI under random classification is 0. Negative values for the ARI are possible if the classifications are worse than what would be expected by random classification. 17

27 Chapter 4 Application In this chapter, model-based clustering experiments were conducted utilizing both real and simulated data. The class labels were treated as missing and all data sets were scaled. The R package MixGHD created by Tortora et al., (2015) contains the functions: MGHD and MGHFA, which were used to implement the MGHD and MGHFA methods respectively. These clustering methods were initialized using the k-means algorithm with the best fitting models being chosen via the BIC. The number of principal components retained for each analysis was contingent upon a threshold of 85 percent of the proportion of variance being explained. The classification performance for these approaches was then assessed using the ARI. 18

28 4.1 Simulated Studies Twonorm Data The twonorm data set contained 20 numerical attributes over 7400 observations and it was simulated using two classes from the multivariate distribution with unit covariance (Breiman, 1996). The first class had mean vector (a, a,..., a) while the second class had mean vector ( a, a,..., a), where a = The MGHFA model was fitted to data for G = 1, 2,..., 5 components and q = 1, 2,..., 5 latent factors. The BIC selected G = 2 components with q = 2 latent factors, and the associated model had an excellent classification performance (Table 4.2; ARI = 0.914). Next, the MGHD model was applied to the first 16 principal components with BIC selecting G = 2 components. The aforementioned components explained roughly 85 percent of the total variation in the data. The corresponding model had an excellent classification performance (Table 4.2; ARI = 0.915). Table 4.1: Cross-tabulation of the true versus predicted class labels for model-based clustering on twonorm data. cluster label A B Type I Type II (a) MGHFA cluster label A B Type I Type II (b) PCA & MGHD Table 4.2: ARI values and misclassification rates (MCR), based on predicted classifications for the unlabelled observations, for the MGHFA and PCA & MGHD models. MGHFA PCA & MGHD ARI MCR

29 Figure 4.2: Projection of the clustered twonorm data in the 2 first principal components of PCA Ringworm Data The ringworm data set contained 20 quantitative features over 7400 instances and it was simulated using two classes from the multivariate distribution (Breiman, 1996). The first class had a zero mean vector and a covariance matrix with 4 s along its main diagonal, and 0 s on the off diagonal entries. The second class had mean vector (a, a,..., a) and a unit covariance matrix. The MGHFA model was fitted to the data resulting in the BIC selecting G = 2 components and q = 1 latent factors. The corresponding model had an excellent classification performance (Table 4.4; ARI = 0.934). Roughly 87 percent of the total variation in the data is accounted for by the first 17 principal components. As a results, the MGHD model was applied to the first 17 principal components, resulting in a model with excellent classification performance (Table 4.4; ARI = 0.927). 20

30 Table 4.3: Cross-tabulation of the true versus predicted class labels for model-based clustering on ringworm data. cluster label A B Type I Type II (a) MGHFA cluster label A B Type I Type II (b) PCA & MGHD Table 4.4: ARI values and misclassification rates (MCR), based on predicted classifications for the unlabelled observations, for the MGHFA and PCA & MGHD models. MGHFA PCA & MGHD ARI MCR Figure 4.3: Projection of the clustered ringnorm data in the 2 first principal components of PCA. 21

31 4.1.3 Waveform Data This data was retrieved from the UCI machine learning repository. The data contained 21 attributes measured over 5000 observations (Breiman, 1996). Each observation was generated from added noise attributes with mean 0 and variance 1. The observations can be split into three equal partitions according to wave type. The MGHFA model was fitted to this data resulting in the BIC selecting G = 3 components and q = 1 latent factors. The associated model had a poor classification performance (Table 4.6; ARI = 0.531). Roughly 86 percent of the total variation in the data was explained by the first 12 components. Consequently, the MGHD model was fitted to the first 12 principal components with BIC selecting G = 3 components. The resulting model had a better classification performance (Table 4.6; ARI = 0.601) compared to the approach using MGHFA. Table 4.5: Cross-tabulation of the true versus predicted class labels for model-based clustering on waveform data. cluster label A B C Type I Type II Type III (a) MGHFA cluster label A B C Type I Type II Type III (b) PCA & MGHD Table 4.6: ARI values and misclassification rates (MCR), based on predicted classifications for the unlabelled observations, for the MGHFA and PCA & MGHD models. MGHFA PCA & MGHD ARI MCR

32 Figure 4.4: Projection of the clustered waveform data in the 2 first principal components of PCA. 4.2 Real Data Wine Data This data set is available in the pgmm library (McNicholas et al., 2011) for R. The data contained 27 chemical and physical measurements on three types of wine, namely Barolo, Grignolino, and Barbera. There were 178 observations of these three types of wine which were all cultivated in the same region of Italy (Forina et al., 1986). Fitting the MGHFA model to this data resulted in the BIC selecting G = 3 componenets and q = 1 latent factors. The corresponding model has a fairly good classification performance (Table 4.8; ARI = 0.714). Implementing PCA revealed that the first 12 principal components accounted for roughly 85 percent of the total variation in the wine data. Based on this the MGHD model was initiated using the first 12 principal 23

33 components and this resulted in a model with a very poor classification performance (Table 4.8; ARI = 0.365) compared to the MGHFA approach. Table 4.7: Cross-tabulation of the true versus predicted class labels for model-based clustering on wine data. cluster label A B C Barolo Grignolino Barbera (a) MGHFA cluster label A B Barolo 57 2 Grignolino Barbera 0 48 (b) PCA & MGHD Table 4.8: ARI values and misclassification rates (MCR), based on predicted classifications for the unlabelled observations, for the MGHFA and PCA & MGHD models. MGHFA PCA & MGHD ARI MCR Satellite data The satellite data was collected from the UCI machine learning repository. The data contained 36 quantitative attributes measured over 4435 observations. Observations were collected with the goal of being able to classify 3 3 neighbourhoods based on a central pixel (Srinivasan, 1993). This data set contained six classes: red soil, cotton crop, grey soil, damp grey soil, soil with vegetation stubble, and very damp grey soil. Applying the MGHFA model to this data leads to the BIC selecting G = 6 components and q = 1 latent factors. The corresponding model gives a poor classification performance (Table 4.11; ARI = 0.397). Using PCA we found that about 89 percent of the total variation in the data is explained by the first three 24

34 principal components. As a result the MGHD model was applied using the first three principal components with BIC selecting G = 6. The associated model had a better classification performance (Table 4.11; ARI = 0.496) than the MGHFA approach. Table 4.9: Cross-tabulation of the true versus predicted class labels for model-based clustering on Landsat Satellite data using MGHFA. cluster label A B C D E F Red Soil Cotton Crop Grey Soil Damp Soil Very Damp Soil Vegetation Stubble Table 4.10: Cross-tabulation of the true versus predicted class labels for model-based clustering on Landsat Satellite data using MGHD & PCA. cluster Red Soil Cotton Crop Grey Soil Damp Soil Very Damp Soil Vegetation Stubble Table 4.11: ARI values and misclassification rates (MCR), based on predicted classifications for the unlabelled observations, for the MGHFA and PCA & MGHD models. MGHFA PCA & MGHD ARI MCR

35 Chapter 5 Conclusions In this thesis, the clustering of high-dimensional data using the MGHFA approach was compared against the procedure utilizing PCA for feature extraction followed by clustering using the MGHD. This analysis was conducted by applying the aforementioned methods to both simulated and real data. For the two data sets simulated from the multivariate normal distribution the approach of using PCA & MGHD demonstrated ARI s comparable to that of the MGHFA approach. Executing either approach on these data resulted in excellent classification performances with over 95 percent class agreement. For the waveform data set, both the associated models had relatively poor classification performances with the PCA & MGHD approach displaying a slightly higher ARI. For the wine data set, using the MGHFA model resulted in a good classification performance while the quality of the partition obtained by using the PCA & MGHD model was disappointing. For the satellite data set the PCA & MGHD model performed slightly better than the MGHFA model but the respective classification performances for both approaches was quite poor. 26

36 Overall, the PCA & MGHD method performed just as well or better than the MGHFA method for most of the data sets. However, caution must be exercised when utilizing PCA for dimension reduction in conjunction with clustering tasks. This is because, as noted by Bouveyron and Brunet-Saumard (2014), PCA was not designed to take into account the clustering scheme and this may lead to sub-optimal partitioning of the data. Possible future research directions could consist of the using MGHD together with other dimension reduction techniques such as variable selection or subspace methods. These approaches can simultaneously find partitions and reduce the dimensionality of the data. 27

37 Appendix A Bessel Functions In general, Bessel functions are the canonical solutions y(x) of the Bessel differential equation: x 2 d2 y dx 2 + xdy dx + (x2 λ 2 )y = 0, for an arbitrary complex number λ. The solutions of the Bessel equation are called modified Bessel functions of the first and second kind when the domain of x includes the complex numbers (Abramowitz and Stegun, 1968). The modified Bessel function of the second kind, also known as the modified Bessel function of the third kind, can be expressed as follows: K λ (x) = π 2 I λ (x) I λ (x), sin(λπ) where I λ (x) is the modified Bessel function of the first kind obtained from the Bessel differential equation using the power series approach. I λ (x) = m=0 1 ( x ) 2m+λ. m! Γ(m + λ + 1) 2 28

38 K λ (x) is an exponentially decaying function with a singularity at x = 0 (Abramowitz and Stegun, 1968) that are an important component of the generalized inverse Gaussian and the generalized hyperbolic distributions. 29

39 Bibliography Abramowitz, M. and Stegun, I. A. (1968). Handbook of Mathematical Tables. New York: Dover Publications. Aitken, A. (1926). A series formula for the roots of algebraic and transcendental equations. Proceedings of the Royal Society of Edinburgh, 45(1), Andrews, J. L. and McNicholas, P. D. (2012). Model-based clustering, classification, and discriminant analysis via mixtures of multivariate t-distributions. Statistics and Computing, 22(5), Barndorff-Nielsen, O. (1977). Exponentially decreasing distributions for the logarithm of particle size. In Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, volume 353, pages Barndorff-Nielsen, O. and Halgreen, C. (1977). Infinite divisibility of the hyperbolic and generalized inverse gaussian distributions. Probability Theory and Related Fields, 38(4), Bartlett, M. S. (1953). Factor analysis in psychology as a statistician sees it. In Uppsala Symposium on Psychological Factor Analysis, number 3, pages Copenhagen: Ejnar Mundsgaards. 30

40 Bellman, R. (1957). Dynamic programming. Princeton University Press. Bouveyron, C. and Brunet-Saumard, C. (2014). Model-based clustering of highdimensional data: A review. Computational Statistics & Data Analysis, 71, Breiman, L. (1996). Bias, variance, and arcing classifiers (technical report 460). Statistics Department, University of California. Browne, R. P. and McNicholas, P. D. (2015). A mixture of generalized hyperbolic distributions. Canadian Journal of Statistics, 43(2), Daas, P. J., Puts, M. J., Buelens, B., and van den Hurk, P. A. (2015). Big data as a source for official statistics. Journal of Official Statistics, 31(2), Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B, 39(1), Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association, 97(458), Ghahramani, Z. and Hinton, G. E. (1997). The em algorithm for mixtures of factor analyzers. Technical report, Technical Report CRG-TR-96-1, University of Toronto. Good, I. J. (1953). The population frequencies of species and the estimation of population parameters. Biometrika, 40(3-4), Halphen, E. (1941). Sur un nouveau type de courbe de fréquence. Comptes Rendus de l Académie des Sciences, 213,

41 Hammerstein, E. A. v. (2010). Generalized hyperbolic distributions: Theory and Applications to CDO Pricing. Ph.D. thesis, PhD thesis, Universität Freiburg. Hartigan, J. A. and Wong, M. A. (1979). Algorithm as 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1), Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24(6), 417. Hu, W. (2005). Calibration of Multivariate Generalized Hyperbolic Distributions Using the EM algorithm, with Applications in Risk Management, Portfolio Optimization and Portfolio Credit Risk. Ph.D. thesis, The Florida State University, Tallahassee. Huber, P. J. (1985). Projection pursuit. The Annals of Statistics, 13(2), Hubert, L. and Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), Johnson, R. and Wichern, D. (2007). Applied multivariate correspondence analysis, pages Prentice-Hall New Jersey, sixth edition. Jorgensen, B. (2012). Statistical properties of the generalized inverse Gaussian distribution, volume 9. Springer Science & Business Media. Juárez, M. A. and Steel, M. F. (2010). Model-based clustering of non-gaussian panel data based on skew-t distributions. Journal of Business & Economic Statistics, 28(1),

42 Karlis, D. and Santourian, A. (2009). Model-based clustering with non-elliptically contoured distributions. Statistics and Computing, 19(1), Keribin, C. (2000). Consistent estimation of the order of mixture models. Sankhyā: The Indian Journal of Statistics, Series A, 62(1), Koudou, A. E., Ley, C., et al. (2014). Characterizations of gig laws: A survey. Probability Surveys, 11, Lawley, D. N. and Maxwell, A. E. (1962). Factor analysis as a statistical method. Journal of the Royal Statistical Society. Series D (The Statistician), 12(3), Leroux, B. G. (1992). Consistent estimation of a mixing distribution. The Annals of Statistics, 20(3), Lopes, H. F. and West, M. (2004). Bayesian model assessment in factor analysis. Statistica Sinica, 14(1), Mangasarian, O. L., Street, W. N., and Wolberg, W. H. (1995). Breast cancer diagnosis and prognosis via linear programming. Operations Research, 43(4), McLachlan, G. and Peel, D. (2000). Mixtures of factor analyzers. In Proceedings of the Seventh International Conference on Machine Learning, pages McLachlan, G. J., Bean, R., and Peel, D. (2002). A mixture model-based approach to the clustering of microarray expression data. Bioinformatics, 18(3), McNeil, A. J., Frey, R., and Embrechts, P. (2015). Quantitative Risk Management: Concepts, Techniques and Tools. Princeton University Press. 33

43 McNicholas, P., Jampani, K., McDaid, A., Murphy, T., and Banks, L. (2011). pgmm: Parsimonious gaussian mixture models. R package version, 1(1). McNicholas, P. D. (2016). Mixture model-based classification. CRC Press. McNicholas, P. D., Murphy, T. B., McDaid, A. F., and Frost, D. (2010). Serial and parallel implementations of model-based clustering via parsimonious gaussian mixture models. Computational Statistics & Data Analysis, 54(3), McNicholas, S. M., McNicholas, P. D., and Browne, R. P. (2017). A mixture of variance-gamma factor analyzers. In S. E. Ahmed, editor, Big and Complex Data Analysis, pages Springer. Meng, X.-L. and Rubin, D. B. (1993). Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika, 80(2), Meng, X.-L. and Van Dyk, D. (1997). The EM algorithm-an old folk-song sung to a fast new tune. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 59(3), Murray, P. M., Browne, R. P., and McNicholas, P. D. (2014). Mixtures of skew-t factor analyzers. Computational Statistics & Data Analysis, 77, Pearson, K. (1901). Principal components analysis. The London, Edinburgh and Dublin Philosophical Magazine and Journal, 6(2), 566. R Core Team, R. (2017). R: A Language and Environment for Statistical Computing. Version R Foundation for Statistical Computing, Vienna, Austria. URL: R-project. org. 34

44 Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336), Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), Scott, D. W. and Thompson, J. R. (1983). Probability density estimation in higher dimensions. In Computer Science and Statistics: Proceedings of the fifteenth symposium on the interface, volume 528, pages North-Holland, Amsterdam. Sigillito, V. G., Wing, S. P., Hutton, L. V., and Baker, K. B. (1989). Classification of radar returns from the ionosphere using neural networks. Johns Hopkins APL Technical Digest, 10(3), Spearman, C. (1904). The proof and measurement of association between two things. The American Journal of Psychology, 15(1), Tiedeman, D. (1955). On the study of types. In S. B. Sells, editor, Symposium on Pattern Analysis, pages 1 14, Randolph Field, TX: Air University. U.S.A.F. School of Aviation Medicine. Tortora, C., McNicholas, P. D., and Browne, R. P. (2015). A mixture of generalized hyperbolic factor analyzers. Advances in Data Analysis and Classification, 10(4), Wolfe, J. H. (1963). Object Cluster Analysis of Social Areas. Master s thesis, University of California. 35

Three Skewed Matrix Variate Distributions

Three Skewed Matrix Variate Distributions Three Skewed Matrix Variate Distributions ariv:174.531v5 [stat.me] 13 Aug 18 Michael P.B. Gallaugher and Paul D. McNicholas Dept. of Mathematics & Statistics, McMaster University, Hamilton, Ontario, Canada.

More information

An Adaptive LASSO-Penalized BIC

An Adaptive LASSO-Penalized BIC An Adaptive LASSO-Penalized BIC Sakyajit Bhattacharya and Paul D. McNicholas arxiv:1406.1332v1 [stat.me] 5 Jun 2014 Dept. of Mathematics and Statistics, University of uelph, Canada. Abstract Mixture models

More information

Parsimonious Gaussian Mixture Models

Parsimonious Gaussian Mixture Models Parsimonious Gaussian Mixture Models Brendan Murphy Department of Statistics, Trinity College Dublin, Ireland. East Liguria West Liguria Umbria North Apulia Coast Sardina Inland Sardinia South Apulia Calabria

More information

Issues on Fitting Univariate and Multivariate Hyperbolic Distributions

Issues on Fitting Univariate and Multivariate Hyperbolic Distributions Issues on Fitting Univariate and Multivariate Hyperbolic Distributions Thomas Tran PhD student Department of Statistics University of Auckland thtam@stat.auckland.ac.nz Fitting Hyperbolic Distributions

More information

Discriminant Analysis for Longitudinal Data

Discriminant Analysis for Longitudinal Data Discriminant Analysis for Longitudinal Data DISCRIMINANT ANALYSIS FOR LONGITUDINAL DATA BY KEVIN MATIRA, B.Sc. a thesis submitted to the department of mathematics & statistics and the school of graduate

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

Latent Variable Models and EM Algorithm

Latent Variable Models and EM Algorithm SC4/SM8 Advanced Topics in Statistical Machine Learning Latent Variable Models and EM Algorithm Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/atsml/

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation Clustering by Mixture Models General bacground on clustering Example method: -means Mixture model based clustering Model estimation 1 Clustering A basic tool in data mining/pattern recognition: Divide

More information

Variable selection for model-based clustering

Variable selection for model-based clustering Variable selection for model-based clustering Matthieu Marbac (Ensai - Crest) Joint works with: M. Sedki (Univ. Paris-sud) and V. Vandewalle (Univ. Lille 2) The problem Objective: Estimation of a partition

More information

arxiv: v1 [stat.me] 7 Aug 2015

arxiv: v1 [stat.me] 7 Aug 2015 Dimension reduction for model-based clustering Luca Scrucca Università degli Studi di Perugia August 0, 05 arxiv:508.07v [stat.me] 7 Aug 05 Abstract We introduce a dimension reduction method for visualizing

More information

Journal of Statistical Software

Journal of Statistical Software JSS Journal of Statistical Software January 2012, Volume 46, Issue 6. http://www.jstatsoft.org/ HDclassif: An R Package for Model-Based Clustering and Discriminant Analysis of High-Dimensional Data Laurent

More information

Unsupervised machine learning

Unsupervised machine learning Chapter 9 Unsupervised machine learning Unsupervised machine learning (a.k.a. cluster analysis) is a set of methods to assign objects into clusters under a predefined distance measure when class labels

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

Extending mixtures of multivariate t-factor analyzers

Extending mixtures of multivariate t-factor analyzers Stat Comput (011) 1:361 373 DOI 10.1007/s11-010-9175- Extending mixtures of multivariate t-factor analyzers Jeffrey L. Andrews Paul D. McNicholas Received: July 009 / Accepted: 1 March 010 / Published

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

CS281 Section 4: Factor Analysis and PCA

CS281 Section 4: Factor Analysis and PCA CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we

More information

Degenerate Expectation-Maximization Algorithm for Local Dimension Reduction

Degenerate Expectation-Maximization Algorithm for Local Dimension Reduction Degenerate Expectation-Maximization Algorithm for Local Dimension Reduction Xiaodong Lin 1 and Yu Zhu 2 1 Statistical and Applied Mathematical Science Institute, RTP, NC, 27709 USA University of Cincinnati,

More information

The Pennsylvania State University The Graduate School Eberly College of Science VISUAL ANALYTICS THROUGH GAUSSIAN MIXTURE MODELS WITH

The Pennsylvania State University The Graduate School Eberly College of Science VISUAL ANALYTICS THROUGH GAUSSIAN MIXTURE MODELS WITH The Pennsylvania State University The Graduate School Eberly College of Science VISUAL ANALYTICS THROUGH GAUSSIAN MIXTURE MODELS WITH SUBSPACE CONSTRAINED COMPONENT MEANS A Thesis in Statistics by Mu Qiao

More information

A pairwise likelihood approach to simultaneous clustering and dimensional reduction of ordinal data

A pairwise likelihood approach to simultaneous clustering and dimensional reduction of ordinal data arxiv:1504.02913v1 [stat.me] 11 Apr 2015 A pairwise likelihood approach to simultaneous clustering and dimensional reduction of ordinal data Monia Ranalli Roberto Rocci Abstract The literature on clustering

More information

arxiv: v1 [stat.me] 27 Nov 2012

arxiv: v1 [stat.me] 27 Nov 2012 A LASSO-Penalized BIC for Mixture Model Selection Sakyajit Bhattacharya and Paul D. McNicholas arxiv:1211.6451v1 [stat.me] 27 Nov 2012 Department of Mathematics & Statistics, University of Guelph. Abstract

More information

Machine Learning for Data Science (CS4786) Lecture 12

Machine Learning for Data Science (CS4786) Lecture 12 Machine Learning for Data Science (CS4786) Lecture 12 Gaussian Mixture Models Course Webpage : http://www.cs.cornell.edu/courses/cs4786/2016fa/ Back to K-means Single link is sensitive to outliners We

More information

Unsupervised Learning

Unsupervised Learning 2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and

More information

Model-based cluster analysis: a Defence. Gilles Celeux Inria Futurs

Model-based cluster analysis: a Defence. Gilles Celeux Inria Futurs Model-based cluster analysis: a Defence Gilles Celeux Inria Futurs Model-based cluster analysis Model-based clustering (MBC) consists of assuming that the data come from a source with several subpopulations.

More information

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means

More information

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means

More information

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a

More information

Machine Learning 2017

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

More information

Factor Analysis (10/2/13)

Factor Analysis (10/2/13) STA561: Probabilistic machine learning Factor Analysis (10/2/13) Lecturer: Barbara Engelhardt Scribes: Li Zhu, Fan Li, Ni Guan Factor Analysis Factor analysis is related to the mixture models we have studied.

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information

Model-Based Clustering of High-Dimensional Data: A review

Model-Based Clustering of High-Dimensional Data: A review Model-Based Clustering of High-Dimensional Data: A review Charles Bouveyron, Camille Brunet To cite this version: Charles Bouveyron, Camille Brunet. Model-Based Clustering of High-Dimensional Data: A review.

More information

High Dimensional Discriminant Analysis

High Dimensional Discriminant Analysis High Dimensional Discriminant Analysis Charles Bouveyron 1,2, Stéphane Girard 1, and Cordelia Schmid 2 1 LMC IMAG, BP 53, Université Grenoble 1, 38041 Grenoble cedex 9 France (e-mail: charles.bouveyron@imag.fr,

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm 1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable

More information

High Dimensional Discriminant Analysis

High Dimensional Discriminant Analysis High Dimensional Discriminant Analysis Charles Bouveyron LMC-IMAG & INRIA Rhône-Alpes Joint work with S. Girard and C. Schmid High Dimensional Discriminant Analysis - Lear seminar p.1/43 Introduction High

More information

Gaussian Mixture Models with Component Means Constrained in Pre-selected Subspaces

Gaussian Mixture Models with Component Means Constrained in Pre-selected Subspaces Gaussian Mixture Models with Component Means Constrained in Pre-selected Subspaces Mu Qiao and Jia Li Abstract We investigate a Gaussian mixture model (GMM) with component means constrained in a pre-selected

More information

STATISTICAL INFERENCE WITH DATA AUGMENTATION AND PARAMETER EXPANSION

STATISTICAL INFERENCE WITH DATA AUGMENTATION AND PARAMETER EXPANSION STATISTICAL INFERENCE WITH arxiv:1512.00847v1 [math.st] 2 Dec 2015 DATA AUGMENTATION AND PARAMETER EXPANSION Yannis G. Yatracos Faculty of Communication and Media Studies Cyprus University of Technology

More information

Families of Parsimonious Finite Mixtures of Regression Models arxiv: v1 [stat.me] 2 Dec 2013

Families of Parsimonious Finite Mixtures of Regression Models arxiv: v1 [stat.me] 2 Dec 2013 Families of Parsimonious Finite Mixtures of Regression Models arxiv:1312.0518v1 [stat.me] 2 Dec 2013 Utkarsh J. Dang and Paul D. McNicholas Department of Mathematics & Statistics, University of Guelph

More information

Adaptive Mixture Discriminant Analysis for. Supervised Learning with Unobserved Classes

Adaptive Mixture Discriminant Analysis for. Supervised Learning with Unobserved Classes Adaptive Mixture Discriminant Analysis for Supervised Learning with Unobserved Classes Charles Bouveyron SAMOS-MATISSE, CES, UMR CNRS 8174 Université Paris 1 (Panthéon-Sorbonne), Paris, France Abstract

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26 Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar

More information

Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units

Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units Sahar Z Zangeneh Robert W. Keener Roderick J.A. Little Abstract In Probability proportional

More information

PCA and admixture models

PCA and admixture models PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57 Announcements HW1

More information

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2 STATS 306B: Unsupervised Learning Spring 2014 Lecture 2 April 2 Lecturer: Lester Mackey Scribe: Junyang Qian, Minzhe Wang 2.1 Recap In the last lecture, we formulated our working definition of unsupervised

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Choosing a model in a Classification purpose. Guillaume Bouchard, Gilles Celeux

Choosing a model in a Classification purpose. Guillaume Bouchard, Gilles Celeux Choosing a model in a Classification purpose Guillaume Bouchard, Gilles Celeux Abstract: We advocate the usefulness of taking into account the modelling purpose when selecting a model. Two situations are

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

A Characterization of Principal Components. for Projection Pursuit. By Richard J. Bolton and Wojtek J. Krzanowski

A Characterization of Principal Components. for Projection Pursuit. By Richard J. Bolton and Wojtek J. Krzanowski A Characterization of Principal Components for Projection Pursuit By Richard J. Bolton and Wojtek J. Krzanowski Department of Mathematical Statistics and Operational Research, University of Exeter, Laver

More information

Approximating the Covariance Matrix with Low-rank Perturbations

Approximating the Covariance Matrix with Low-rank Perturbations Approximating the Covariance Matrix with Low-rank Perturbations Malik Magdon-Ismail and Jonathan T. Purnell Department of Computer Science Rensselaer Polytechnic Institute Troy, NY 12180 {magdon,purnej}@cs.rpi.edu

More information

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute

More information

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models Jaemo Sung 1, Sung-Yang Bang 1, Seungjin Choi 1, and Zoubin Ghahramani 2 1 Department of Computer Science, POSTECH,

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception LINEAR MODELS FOR CLASSIFICATION Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification,

More information

Model Selection for Mixtures of Factor Analyzers via Hierarchical BIC

Model Selection for Mixtures of Factor Analyzers via Hierarchical BIC Stat Comput manuscript No. (will be inserted by the editor) Model Selection for Mixtures of Factor Analyzers via Hierarchical BIC Jianhua Zhao Philip L.H. Yu Lei Shi Received: date / Accepted: date Abstract

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

Qualifying Exam in Machine Learning

Qualifying Exam in Machine Learning Qualifying Exam in Machine Learning October 20, 2009 Instructions: Answer two out of the three questions in Part 1. In addition, answer two out of three questions in two additional parts (choose two parts

More information

Probabilistic Fisher Discriminant Analysis

Probabilistic Fisher Discriminant Analysis Probabilistic Fisher Discriminant Analysis Charles Bouveyron 1 and Camille Brunet 2 1- University Paris 1 Panthéon-Sorbonne Laboratoire SAMM, EA 4543 90 rue de Tolbiac 75013 PARIS - FRANCE 2- University

More information

The Expectation Maximization Algorithm

The Expectation Maximization Algorithm The Expectation Maximization Algorithm Frank Dellaert College of Computing, Georgia Institute of Technology Technical Report number GIT-GVU-- February Abstract This note represents my attempt at explaining

More information

Mixtures and Hidden Markov Models for analyzing genomic data

Mixtures and Hidden Markov Models for analyzing genomic data Mixtures and Hidden Markov Models for analyzing genomic data Marie-Laure Martin-Magniette UMR AgroParisTech/INRA Mathématique et Informatique Appliquées, Paris UMR INRA/UEVE ERL CNRS Unité de Recherche

More information

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan Clustering CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Supervised vs Unsupervised Learning Supervised learning Given x ", y " "%& ', learn a function f: X Y Categorical output classification

More information

Overview. and data transformations of gene expression data. Toy 2-d Clustering Example. K-Means. Motivation. Model-based clustering

Overview. and data transformations of gene expression data. Toy 2-d Clustering Example. K-Means. Motivation. Model-based clustering Model-based clustering and data transformations of gene expression data Walter L. Ruzzo University of Washington UW CSE Computational Biology Group 2 Toy 2-d Clustering Example K-Means? 3 4 Hierarchical

More information

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides Intelligent Data Analysis and Probabilistic Inference Lecture

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some

More information

Applied Multivariate Statistical Analysis Richard Johnson Dean Wichern Sixth Edition

Applied Multivariate Statistical Analysis Richard Johnson Dean Wichern Sixth Edition Applied Multivariate Statistical Analysis Richard Johnson Dean Wichern Sixth Edition Pearson Education Limited Edinburgh Gate Harlow Essex CM20 2JE England and Associated Companies throughout the world

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

New Global Optimization Algorithms for Model-Based Clustering

New Global Optimization Algorithms for Model-Based Clustering New Global Optimization Algorithms for Model-Based Clustering Jeffrey W. Heath Department of Mathematics University of Maryland, College Park, MD 7, jheath@math.umd.edu Michael C. Fu Robert H. Smith School

More information

Lecture 3: Pattern Classification

Lecture 3: Pattern Classification EE E6820: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 1 2 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mixtures

More information

An Introduction to Multivariate Statistical Analysis

An Introduction to Multivariate Statistical Analysis An Introduction to Multivariate Statistical Analysis Third Edition T. W. ANDERSON Stanford University Department of Statistics Stanford, CA WILEY- INTERSCIENCE A JOHN WILEY & SONS, INC., PUBLICATION Contents

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

Gaussian Mixture Models, Expectation Maximization

Gaussian Mixture Models, Expectation Maximization Gaussian Mixture Models, Expectation Maximization Instructor: Jessica Wu Harvey Mudd College The instructor gratefully acknowledges Andrew Ng (Stanford), Andrew Moore (CMU), Eric Eaton (UPenn), David Kauchak

More information

Lecture 14. Clustering, K-means, and EM

Lecture 14. Clustering, K-means, and EM Lecture 14. Clustering, K-means, and EM Prof. Alan Yuille Spring 2014 Outline 1. Clustering 2. K-means 3. EM 1 Clustering Task: Given a set of unlabeled data D = {x 1,..., x n }, we do the following: 1.

More information

Package EMMIXmfa. October 18, Index 15

Package EMMIXmfa. October 18, Index 15 Type Package Package EMMIXmfa October 18, 2017 Title Mixture models with component-wise factor analyzers Version 1.3.9 Date 2017-10-18 Author Suren Rathnayake, Geoff McLachlan, Jungsun Baek Maintainer

More information

Expectation Maximization Algorithm

Expectation Maximization Algorithm Expectation Maximization Algorithm Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein, Luke Zettlemoyer and Dan Weld The Evils of Hard Assignments? Clusters

More information

Unsupervised Classification via Convex Absolute Value Inequalities

Unsupervised Classification via Convex Absolute Value Inequalities Unsupervised Classification via Convex Absolute Value Inequalities Olvi L. Mangasarian Abstract We consider the problem of classifying completely unlabeled data by using convex inequalities that contain

More information

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inrialpes.fr http://perception.inrialpes.fr/ Outline of Lecture

More information

A Mixture of Coalesced Generalized Hyperbolic Distributions

A Mixture of Coalesced Generalized Hyperbolic Distributions A Mixture of Coalesced Generalized Hyperbolic Distributions arxiv:1403.2332v7 [stat.me] 6 Nov 2017 Cristina Tortora, Brian C. Franczak, Ryan P. Browne and Paul D. McNicholas Department of Mathematics and

More information

Clustering VS Classification

Clustering VS Classification MCQ Clustering VS Classification 1. What is the relation between the distance between clusters and the corresponding class discriminability? a. proportional b. inversely-proportional c. no-relation Ans:

More information

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions

More information

Label Switching and Its Simple Solutions for Frequentist Mixture Models

Label Switching and Its Simple Solutions for Frequentist Mixture Models Label Switching and Its Simple Solutions for Frequentist Mixture Models Weixin Yao Department of Statistics, Kansas State University, Manhattan, Kansas 66506, U.S.A. wxyao@ksu.edu Abstract The label switching

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

EM Algorithm II. September 11, 2018

EM Algorithm II. September 11, 2018 EM Algorithm II September 11, 2018 Review EM 1/27 (Y obs, Y mis ) f (y obs, y mis θ), we observe Y obs but not Y mis Complete-data log likelihood: l C (θ Y obs, Y mis ) = log { f (Y obs, Y mis θ) Observed-data

More information

Dimensional reduction of clustered data sets

Dimensional reduction of clustered data sets Dimensional reduction of clustered data sets Guido Sanguinetti 5th February 2007 Abstract We present a novel probabilistic latent variable model to perform linear dimensional reduction on data sets which

More information

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades

More information

Chapter 10. Semi-Supervised Learning

Chapter 10. Semi-Supervised Learning Chapter 10. Semi-Supervised Learning Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan Outline

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

High Dimensional Discriminant Analysis

High Dimensional Discriminant Analysis High Dimensional Discriminant Analysis Charles Bouveyron LMC-IMAG & INRIA Rhône-Alpes Joint work with S. Girard and C. Schmid ASMDA Brest May 2005 Introduction Modern data are high dimensional: Imagery:

More information

Sequential Procedure for Testing Hypothesis about Mean of Latent Gaussian Process

Sequential Procedure for Testing Hypothesis about Mean of Latent Gaussian Process Applied Mathematical Sciences, Vol. 4, 2010, no. 62, 3083-3093 Sequential Procedure for Testing Hypothesis about Mean of Latent Gaussian Process Julia Bondarenko Helmut-Schmidt University Hamburg University

More information

Curves clustering with approximation of the density of functional random variables

Curves clustering with approximation of the density of functional random variables Curves clustering with approximation of the density of functional random variables Julien Jacques and Cristian Preda Laboratoire Paul Painlevé, UMR CNRS 8524, University Lille I, Lille, France INRIA Lille-Nord

More information

Different points of view for selecting a latent structure model

Different points of view for selecting a latent structure model Different points of view for selecting a latent structure model Gilles Celeux Inria Saclay-Île-de-France, Université Paris-Sud Latent structure models: two different point of views Density estimation LSM

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory Statistical Inference Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory IP, José Bioucas Dias, IST, 2007

More information

Statistical Learning. Dong Liu. Dept. EEIS, USTC

Statistical Learning. Dong Liu. Dept. EEIS, USTC Statistical Learning Dong Liu Dept. EEIS, USTC Chapter 6. Unsupervised and Semi-Supervised Learning 1. Unsupervised learning 2. k-means 3. Gaussian mixture model 4. Other approaches to clustering 5. Principle

More information

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014. Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)

More information

Multivariate Statistics

Multivariate Statistics Multivariate Statistics Chapter 4: Factor analysis Pedro Galeano Departamento de Estadística Universidad Carlos III de Madrid pedro.galeano@uc3m.es Course 2017/2018 Master in Mathematical Engineering Pedro

More information

ABSTRACT INTRODUCTION

ABSTRACT INTRODUCTION ABSTRACT Presented in this paper is an approach to fault diagnosis based on a unifying review of linear Gaussian models. The unifying review draws together different algorithms such as PCA, factor analysis,

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a

More information

Module Master Recherche Apprentissage et Fouille

Module Master Recherche Apprentissage et Fouille Module Master Recherche Apprentissage et Fouille Michele Sebag Balazs Kegl Antoine Cornuéjols http://tao.lri.fr 19 novembre 2008 Unsupervised Learning Clustering Data Streaming Application: Clustering

More information