CMPS 242: Project Report

Size: px

Start display at page:

Download "CMPS 242: Project Report"

Ginger McKinney
5 years ago
Views:

1 CMPS 242: Project Report RadhaKrishna Vuppala Univ. of California, Santa Cruz Abstract The classification procedures impose certain models on the data and when the assumption match the data, they perform well. The assumption may be that the data is linearly separable or that the data comes some parametric family of distributions for example. The parametric models can be made flexible enough to handle various types of data by using mixture models instead assuming a single distribution. However even the parametric mixture models assume finite components exist in data. Bayesian non parametric mixtures based on Dirichlet process mixtures (DPM) are a way of modeling infinite mixtures and are more flexible in terms of modeling both the form and number of distributions that data could be generated from. The paper explores the strengths and weaknesses of these models. The variational approximation for the DPM model is explored in this paper. Variational approximations are deterministic approximations techniques that have fast convergence and can be used to replace the costly MCMC computations in some situations where exact answer is not required. 1. Introduction The bayesian nonparametrics received a strong push when the Dirichlet process was presented by Ferguson. [4]. The Dirichlet process is a measure on measures. i.e. when a sample is drawn from the Dirichlet process, a distribution is obtained. Describing a process with a large support and with a analytically tractable posterior was a great step forward for bayesian analysis. G {G 0,α} DP(α,G 0 ) η n G, n 1,...,N If G is a random probability measure drawn from a Dirichlet process, and η n are n samples drawn from G, then it can be shown by integrating out G, that the joint distribution of {η 1,...η n } follows a Polya Urn scheme as demonstrated by Blackwell and MacQueen. The probability that a newsample η n is equal to one of the existing samples is given as follows. P(η n = η j ) = = N j N + α 1 α N + α 1 j = 1,...,n 1 j = n where N j = {k : η k = η j }. This indicates that these samples show a clustering behavior. Let N j indicate as above the numbers of samples with value η j. A new sample has a value equal to one of the existing samples (say η j ) with a probability proportional to number of observations in the cluster (i.e. N j ). Also there is a finite probability that sample is part of a new cluster. The parameter α thus plays a critical role in determining how probable a new cluster would be. 1

2 A powerful constructive characterization of the Dirichlet process was given Sethuraman [12] and is known as Stick breaking construction. Considering two infinite collections of independent random variables V i Beta(1,α) and η i G 0, for i = 1,2,..., the stick breaking representation for G, a sample from DP is given by i 1 π i (v) = v i (1 v j ) (1) G = j=1 i=1 π i (v)δ η i (2) Thus dirichlet process is shown to be a discrete distribution with infinite support, with each atom being drawn from G 0. Since DP has infinite support, it can be used as a prior for nonparametric data models. Dirichlet Process Mixture model [1], DPM, is a nonparametric data model which uses DP in a hierarchical specification as follows. y θ i F(θ i ) θ i G G G DP(α,G 0 ) The data y 1,y 2,...,y n is assumed to be i.i.d data obtained from some distribution F(.). The parameters of the distribution are samples from DP. Since DP is a mixture distribution, the data y also forms a infinite mixture model. The data distribution also forms a Polya Urn and hence shows a clustering behavior. The data from any single cluster is modeled as any simple distribution, normal distribution being usually preferred, and the mixture proportions(π i (v)) indicate how likely the data could have from a particular cluster. This model is more flexible than data in that it can model data from multiple distributions. Another interesting observation is that the number of clusters is not a parameter in this model. The inference algorithm is able to deduce the number of components. Traditional clustering algorithms (K-means) need to be provided with the number of components of in the mixture. As is common in bayesian methods, the posteriors resulting from these models are intractable and no closed form solutions can be found. While this has troubled the Bayesian community for while, the introduction of Monte Carlo Markov chain sampling methods [9, 6, 13, 5] have provided a necessary breakthrough for the bayesian learning. Monte Carlo Markov Chain (MCMC) methods which are computationally expensive and are based on probabilistic sampling are typically used to estimate the intractable posterior distributions. The Gibbs sampling, which is a special case of Metropolis Hastings algorithm, applicable when conjugate priors are used is a very popular method for estimation. The gibbs sampling algorithms for DPM model were developed by Neal [10]. An interesting extension of this model is the Nested Dirichlet Process model (NDP) [11] which models hierarchical relationships in the data. Here data in a class is assumed to come from a mixture of distributions and hence can be more flexible in capturing data relationships. This model can capture more complex relationships in data. y i l=1 ({w il } l=1,{φ il} l=1 ) Ḡ w il h( φ il ) Ḡ DP(αḠ 0 ) Rest of the report is organized as follows. Section two discusses an algorithm for estimating Nested Dirichlet Process model. As part of the project variational approximation method was studied in some detail. The variational approximation technique is discussed in section three and applied to the specific case of DPM model of data. The section four experimentally analyses MCMC implementation NDP model with other well known classifiers. The variational approximation for DPM model is compared with the Gibbs implementation. Section five describes 2

3 some of the key learnings from the work and also describes some future work. 2. Gibbs sampling algorithms This section describes a Pólya urn Gibbs sampler for the NDP model. In this case we introduce two sets of labels, the class labels ζ 1,...,ζ n indicating to which class each observation belongs to, and the component labels ξ 1,...,ξ n indicating to which component the observation is assigned, i.e., ζ i = k and ξ i = l if and only if θ i = φ lk (and therefore G i = G k ). Without loss of generality, we assume that classes are labeled continuously between 1 and K, and that the mixture components corresponding to the k-th distribution are also labeled continuously between 1 and L k. We use negative superscripts throughout to denote quantities computed after removing the relevant observation. Note that the process generating the sequence of G i s is just a Dirichlet process (in a general Polish space), and therefore we can express the full conditional distribution ζ i ζi as a Pólya urn. Similarly, all observations assigned to the same class arise from another Dirichlet process. Combining these two observations, we can get the joint full conditional distribution, ζ i,ξ i ζi,ξi K i L i k k=1 l=1 K k=1 k α + n 1 k lk β k + k β k α + n 1 α α + n 1 δ (K +1,1) β k + k δ (k,l) + δ (k,l k +1) + (3) where where lk corresponds to the number of observations in class k assigned to component l and k = L k l=1 n i lk is the total number of observation in class k. Note that the first term in (3) provides the prior full conditional probability that observation i is assigned to one of the existing components in class k, while the second term corresponds to the prior probability that a new class is created in class k to accommodate observation i and the third term is the prior probability that it is assigned to a new class. This result can be used to construct a Gibbs sampler to estimate the model. The data is assumed to be normal and a Normal-Inverse- Wishart prior is imposed on the parameters. y i ζ i,ξ i N(µ i,σ i ) µ i Σ N(µ 0,Σ/κ 0 ) (4) Σ i Inv Wishart ν0 (Λ 1 0 ) The probability that an observation is assigned a new class or a new component in an existing class can be obtained from the prior predictive probability with G 0 as the prior and y i as the data. P prior = N(y i,φ)dg 0 (φ) κ0 = π p/2 1 + κ (5) 0 Gamma p ((ν n + 1 i)/2) Λ 0 ν 0/2 Gamma p ((ν i)/2 Λ n ν n/2 In the above equations, p is the dimensionality of data, φ is the parameters for prior (i.e. φ = (µ,σ)), ν n and Λ n are the parameters of the posterior of φ based on prior G 0 and the current observation. The probability of observation being assigned a new class label is given by P(ζ i = K + 1) = α P prior n 1 + α (6) The probability of observation being assigned a new component label l within same class k is given by P(ζ i = k,ξ = L k + 1) = β.k + β i n i k P prior n + α 1 (7) The probability that observation belongs to an existing non empty component l in class k is given by the post predictive of observations in the component. The post predictive probability 3

4 (P post ) is obtained as above but by replacing the prior parameters with those of posterior (that results from prior G 0 and normal likelihood of the observations in the component). P(ζ i = k) =.k n + α 1 lk.k + beta k P post (8) The class and component labels for an observation are randomly sampled from the discrete distribution defined by probabilities computed as above. After the class/component labels are sampled, the α and the β k are sampled as described in [3]. The model can be more flexible by sampling the hyper parameters µ 0, κ 0 and Γ 0. (This will be undertaken as future work.) Algorithm 1 Gibbs sampling for Nested DP Initialization: assign random labels to the test data while iterations = 1 to niter do {draw ζ,ξ labels for each observation} for each observation in test data do draw ζ i, ξ i as per (8), (9), (10) end for {draw component labels for all observations} for each observation in all data do draw ξ i labels end for {Sample the model parameters} sample alpha sample beta end while derive a point estimate for the ζ labels. 3. Variational Approximation for DPM The MCMC methods including gibbs sampling methods are computationally expensive and are more so for large datasets. The probabilistic sampling methods are thus not well suited for many classification tasks that require faster turnaround times. The variational approach provides a deterministic alternative method for dealing with intractable priors. Recently there has been significant interest in applying variational approach to the bayesian problem [8, 7, 2]. Variational approximation has been applied to various models including the DPM [2]. Variational approach has its origins in the calculus of variations. A problem is expressed as an optimization problem in which quantity being optimized is a functional. The solution is obtained by exploring all possible input functions. When the whole solution space is explored the solution obtained would be the exact solution. But in many situations exploring the full solution space may not be practical. Instead of searching in the full space, the space is restricted in a manner that makes computation tractable and this leads to an approximate solution. In the bayesian inference problem the typical restriction takes form of assumption that the solution is factorizable. i.e. we assume that the posterior distribution (which is the quantity of interest) factorizes into many independent distributions. Consider a full bayesian model in which all parameters have priors assigned. let the model parameters as well as the hidden variables of the model be denoted by Z and the observed variables be denoted by X. As in any hidden variable problem (for example EM) the joint distribution of (X,Z) is assumed and the goal is to find an approximation for P(Z X). The log marginal probability can be decomposed as follows. ln p(x) = L (q) + KL(q p) where the quantities introduced are defined as { } p(x,z) L (q) = q(z)ln dz q(z) { } p(z X) KL(q p) = q(z)ln dz q(z) The log marginal distribution for X is defined in terms of a variable quantity, q(z). The equa- 4

5 tion holds for any q(z). When the KL divergence between q(z) and p(z X) is zero, the quantity L (q) equals the log marginal. This occurs when q(z) equals the posterior distribution p(z X). For any other values L (q) acts as a lower bound for the log marginal. If it is possible to explore all the possible functions q(z) the exact solution can be easily found. When this is not possible, then approximation can be sought by optimal value among a restricted set of values for q(z). Assuming that Z can be partition into Z i,i = 1,...,M, the variable function q(z) is assumed to be of the form given by q(z) = M i=1 q i (Z i ) Substituting the approximation for q(z), the form of the solution can be found as follows. { } p(x,z) L (q) = q(z)ln dz q(z) { } = q j ln p(x,z) q i dz i = i j q j ln q j dz j + const q j ln p(x,z)dz j + const q j ln q j dz j where q j was used as short form for q j (Z j ) and ln p(x,z) = lnp(x,z) i j q i. It can be seen that the last equation above is a negative KL divergence between p(x,z) and q j (Z j ). L (q) is maximized when the negative KL divergence is minimized and this occurs when q j (Z j ) = p(x, Z). The general form for an optimum solution for q j (Z j ) is given by ln q j(z j ) = ln p(x,z) q i i j = E i j (ln p(x,z)) Since optimal solution for q j (Z j ) will depend on other factors, consistent solution must be found by iterating through each factor and replacing the values for each factor with revised estimate. This leads to a coordinate ascent algorithm. It has been shown that convergence is guaranteed in such a setting as the bound is convex with respect to each factor. Based on the above, an algorithm for the DPM model can be obtained as below. The stick breaking constructive model is defined as follows. The hidden parameters for the model are given by Z = {V,η,W}. The W are the hidden variables that indicate which component the data sample belongs. W is a one of K variable, ie a multinomial distributed variable. Assuming that the component distribution is a multivariate normal, we have η = (µ,λ) where µ is the mean and Λ is the precision matrix for the normal data distribution. The conjugate priors are given by V beta(1,β 0 ) (µ,λ) normal gamma(µ 0,κ 0,ν 0,Λ 0 ) W = multinomial(π i (v i )) Let the solution q(z) to be factorized as q(z) = M i q (αt,β t )(v i ) q (µn,κ n,ν t,λ t )(ηt ) q φn (w n ) i n with subscripts on q() indicating the parameters for the distributions. We assume there is one indicator variable, W n, per data sample and one η t per component and also one beta variable, V i per component. Variational distribution q(v i ) is a beta distribution with parameters (α t,β t ). Variational distribution q(η t ) is a normal-wishart distribution with parameters (µ t,κ t,ν t,λ t ) and q(w n ) is multinomial distribution with φ n as parameters. Defining the following quantities N t = φ nt, with φ n = E(W n = t) n x t = n φ nt X n N k S t = n φ nt (X n x t ) T (X n x t ) N t 5

6 The updates are as follows α new t = 1 + N t 2 β new t = β 0 + n κ new t = κ 0 + N t T φ n j j=t+1 µ t new = n φ nt X n + µ 0 κ 0 N t + κ 0 ν t = ν 0 + N t (Λt new ) 1 = Λ N t S t + κ 0N t ( x t µ 0 )( x t µ 0 ) T κ 0 + N t φnt new = exp(0.5e(ln Λ t ) + D κ t + ν t (X n µ t )Λ t (X n µ t ) T ) +φ nt (Ψ(α t ) Ψ(α t + β t )) + T t=1 { T j=t+1 φ n j Ψ(α t ) Ψ(α t + β t )}) Algorithm 2 Variational approximation for DPM Initialization: initialize the parameters of the variational distributions. i.e. initialize α t,β t, µ t,κ t,ν t,λ t for t = 1,..,T, φ n, for n = 1,...,N. while (convergence is not reached) Update (µ t,κ t,ν t,λ t ) Update (α t,β t ) Update φ n 4. Experiments The experiments were conducted using a collection of real data sets. The datasets were obtained from the UCI machine learning data repository. Only the data sets without any missing values and with numerical attributes were used for the experiments. Datasets Glass dataset consists of 214 instances of multiattribute data that describes various samples of glass. The number of attributes in the data is 10. Each of the attributes is continuos numerical value which indicates amount of a chemical element in the glass instance. The first attribute is id and can be ignored in the classification tasks. The target attribute is categorical and can have 7 different values indicating various types of glass. Thus it is 7-class classification problem. Iris dataset is well known dataset containing three classes of iris plant with each class containing 50 instances. One of the classes is linearly separable from others and it hence easily classifiable. The other two classes are not linearly separable and hence present some challenge for the classifier. The attributes are real values which describe various characteristics of the iris plant. This is considered a very simple dataset and used here as a basic test case. Balance data set contains set of measurements of weights placed on a balance and target class indicates whether the scale is a balanced or not. The attributes are integer values which specify the left and right weights and the distance at which they are placed from the center. This is an example of output being a result of a simple mathematical equation where the correct result can be easily determined by the relation between (left weight * left distance ) and (right weight * right distance). However it is a non linear relationship between attributes and may make the data difficult to learn. The datasets were used to compare the performance of SVM, J48 (decision trees), MDA(mixture discriminant analysis package) 6

7 and NDP. The results are as shown below. The results are shown as percentage error averaged over couple of runs. dataset/classifer MDA J48 SVM NDP Glass Iris Balance As expected, Iris was the easiest dataset to classify. Glass is hard for all the datasets and they all show similar performance. The interesting result with balance dataset is the very bad performance shown by NDP. It is not clear if this is due to the incorrect implementation or if the model assumptions do not match the data. But since MDA also makes similar assumptions about data (that the data is composed of a mixture of gaussians), I am inclined to think that the implementation of NDP was not correct. Next section shows that NDP/DPM classifier has a problem datasets with components with small separation. Future work will investigate this problem. Figure 1. variational approximation of gaussian mixtures with various component separations. Comparing variational approximation with DPM Variational approximation algorithm was implemented in R software package. However only the univariate case works correctly and multivariate code is still a work in progress. (The code has a unknown defect and final approximation degenerates to a simple gaussian irrespective of data distribution). The goodness of approximation was verified using visual plots. The data was generated using a equiprobable mixture of normals with means separated by a variable gap and with small variance for each component. nth component has mean n*gap and variance n. The visual comparison between the original and the predictive distributions of DPM and variational approximations are shown in the figures. The original distribution is shown in green, while variational distribution is in red and DPM based estimate is in blue. The actual data points are Figure 2. variational approximation of gaussian mixtures with various component separations. shown at the bottom of the plot twice with colors to indicate the clusters detected by DPM and Var algorithms. The upper one is for Variational clusters and lower one for the DPM clusters. In figure 1, with gap=5, it can be seen that DPM predictive distribution does not match the 7

8 Figure 3. variational approximation of gaussian mixtures with various component separations. Figure 5. variational approximation of gaussian mixtures with various component separations. Figure 4. variational approximation of gaussian mixtures with various component separations. actual distribution that well. The BLUE curve misses two components in the data. (it has only two modes). The data points for the higher components (points to the right) are a bit closer to each other and DPM considers them to be in a single component. The clustering shown by the lower colored markings shows the DPM prediction for each data point. There are only two colors implying only two components are detected by DPM.. Variational distribution detects all the components, though its estimation of means and variances of the components is a bit off. Similar thing can be observed for gap Figure 3 onwards, DPM does a much better job in the estimating the data distribution. The Variational seems to do well but it shows bad approximation for both gap=20 and gap=40. This brings up one of the issues with the variational algorithm. The variational algorithm has this tendency to get stuck in the local maxima. Blei [2] warns of this issue and suggests that initialization of variational distribution parameters requires some care to avoid the local maxima. Their recommendation is to initialize the variational distribution by incrementally updating the parameters according to a random permutation of the data points. Even then, multiple runs must be done and the parameter settings that give the best bound selected. This randomization step was not performed in this experiments and that explains these odd fits. 8

9 The Dirichlet models also show strange behavior when the components separations are small. As the gap increases the fit becomes much better and closely follows the original distribution. As indicated in the previous section, this could an implementation issue rather than being a model restriction and must be investigated further. While this observation is on the DPM model, it also applies to the NDP model as they both share the implementation. 5. Conclusions and Future work This report described the comparison of NDP model with other classifiers and also a preliminary analysis of the effectiveness of variational approximation algorithm. The learning with respect to DPM, NDP models was that there were either hidden implementation defects or model assumptions that cause them to classify certain datasets poorly. It is observed that when ever components of a data are close (as was case with balance-scale dataset) NDP or DPM are unable to distinguish the components causing misclassifications. This is unlikely to be restriction of the model as MDA algorithm has a very similar model and yet yields better results. DPM models should be able to perform equally well or better than MDA models. The anomaly observed is likely to an implementation issue and will be investigated further. (I have already spent some time on this issue, with no result yet). The task of implementing variational approximation involved the study of various methods and then deriving the mathematical model for a specific setting that assumes a normal data distribution. The task of implementing variational approximation for a DPM model turned out to be lot harder than it looked. While it was a simple coordinate ascent algorithm, the mathematics involved was a bit tedious and also the algorithm is sensitive to the choice of the priors. Also there is always a chance that the variational distribution gets stuck in a local maxima. All these factors have made the work very hard. The multivariate approximation was implemented, but does not yield appropriate results due to possible implementation errors or modeling errors. The future work will be to correct this issue. Overall it was observed that the algorithm converges faster and can also yields good results when convergence was to the maxima. The variational approximations seem to work well when they, but the algorithm still needs to be improved such that it always converges to a global maxima. References [1] C.E. Antoniak. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Annals of Statistics, 2: , [2] D.M. Blei and M.I. Jordan. Variational inference for Dirichlet process mixtures. Bayesian Analysis, 1(1): , [3] M. D. Escobar and M. West. Bayesian density estimation and inference using mixtures. Journal of American Statistical Association, 90: , [4] T. S. Ferguson. A Bayesian analysis of some nonparametric problems. Annals of Statistics, 1: , [5] A.E. Gelfand. Gibbs Sampling. Journal of the American Statistical Association, 95(452), [6] WK Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1):97 109, [7] T.S. Jaakkola and M.I. Jordan. Bayesian parameter estimation via variational methods. Statistics and Computing, 10(1):25 37, [8] M.I. Jordan. Learning in graphical models. Kluwer Academic Publishers,

10 [9] N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, E. Teller, et al. Equation of state calculations by fast computing machines. The journal of chemical physics, 21(6):1087, [10] R.M. Neal. Markov Chain Sampling Methods for Dirichlet Process Mixture Models. JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 9(2): , [11] A. Rodriguez, D.B. Dunson, and AE Gelfang. The nested Dirichlet process. Journal of the American Statistical Association, submitted, [12] J. Sethuraman. A constructive definition of dirichlet priors. Statistica Sinica, 4: , [13] A. F. M. Smith and G. O. Roberts. Bayesian computation via the gibbs sampler and related markov chain monte carlo methods. Journal of the Royal Statistical Society. Series B (Methodological), 55(1):3 23,

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate