TWO MODELS INVOLVING BAYESIAN NONPARAMETRIC TECHNIQUES

Size: px

Start display at page:

Download "TWO MODELS INVOLVING BAYESIAN NONPARAMETRIC TECHNIQUES"

Hugo Mathews
5 years ago
Views:

1 TWO MODELS INVOLVING BAYESIAN NONPARAMETRIC TECHNIQUES By SUBHAJIT SENGUPTA A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2013

2 c 2013 Subhajit Sengupta 2

3 This dissertation is dedicated to my Maa and Baba for their endless support and love 3

4 ACKNOWLEDGMENTS After completing a wonderful voyage of six and half years, I would like to take this opportunity to thank all the people who accompanied me in this memorable long journey. Without their support and encouragement, it would be impossible for me to come to this point. First of all, I would like to thank my advisors and mentors Dr. Arunava Banerjee and Dr. Jeffrey Ho without whom I would not be here to write my dissertation in the first place. This is probably the first time I am being so formal as to thank them. I owe them a debt of gratitude for their patience, immense support, kindness and belief in me. Arunava is the person who instilled the desire to do meaningful research. He has amazing motivational prowess. There is no way I can finish talking about him within a paragraph. I am especially thankful to Jeff for being so generous with his time for last two and half years. We had countless interesting and exciting discussion which I will be missing the most while I will not be in Gainesville. His appetite for knowledge should be a dream for every young researcher. I feel really blessed to have Arunava and Jeff as my teachers, from whom I will never finish learning in my entire life. I would like to thank all other members of my committee Dr. Paul Gader, Dr. Alireza Entezari, Dr. Malay Ghosh for spending their invaluable time for numerous helpful discussions. I would also like to thank Dr. Arunava Banerjee, Dr. Jeffrey Ho, Dr. Anand Rangarajan, Dr. Meera Sitharam, Dr. Paul Gader for their excellent courses in Computer Science department. Outside our department, I would like to thank Dr. Rosalsky, Dr. Robinson, Dr. Ghosh, Dr. Doss, Dr. Presnell, Dr. Hobert for their wonderful mathematics and statistics courses. I am also very thankful to Dr. Donald Richards from Penn State University for insightful communications over s related to the first problem that we will discuss in this dissertation. I was partially supported by a grant IIS ) from the National Science Foundation to Arunava, back in and by a research assistantship under Jeff in 2011 which I gratefully acknowledge. I am also very thankful to the CISE Department for supporting me 4

5 through teaching assistantships TA) over the years. It has been a great privilege to spend several years in the CISE department at University of Florida. These moments will always remain dear to me. I am thankful to John Bowers and Joan Crisman for all their help with administrative issues. I am very grateful to Wikimedia foundation for their wonderful effort of creating a worldwide knowledge resource for every single human being. I would like to thank all of my house mates over the years and all my friends in US and other parts of the world. They were always the source of joy, laughter and support. I d like to give special thanks to my friends Kiranmoy Das and Subhadip Pal with whom I spent lot of time discussing many interesting statistical problems. I would like to thank all of my lab-mates Karthik Gurumoothy, Venkatakrishnan Ramaswami, Ajit Rajwade, John Corring, Mohsen Ali, Jason Chi, Manu Sethi, Shahed Nejhum, Nathan Vanderkraats, Neko Fisher. I had an wonderful time in my lab enjoying countless interesting conversation with all of you guys. I know that I left out many names of my really good friends but you know who you are! Finally I am thankful to my loving and caring maa mother) and baba father), who always believe in me, supported me every minute, teaching me the values for being a disciplined and responsible person and helped me to finish one of my exciting and enjoyable battles in my life. 5

6 TABLE OF CONTENTS page ACKNOWLEDGMENTS LIST OF TABLES LIST OF FIGURES ABSTRACT CHAPTER 1 INTRODUCTION Problem Statements Previous Related Work Organization of the Dissertation BAYESIAN INFERENCE AND NONPARAMETRIC BAYESIAN FRAMEWORK Bayesian Theory MAP Estimate Conjugate Prior Nonparametric Bayesian Motivation and Theoretical Background Exchangeability De Finetti Theorem Dirichlet Distribution and Dirichlet Process Posterior Distribution for DP Polya s Urn Scheme or CRP Stick Breaking Representation of DP Dirichlet Process Mixture Model DPMM) Beta Process BP) Completely Random Measure CRM) Another viewpoint of CRM BP Bernoulli process BeP) and Indian Buffet Process IBP) Connection to IBP BP in a Nutshell Markov Chain Monte Carlo Sampling Metropolis-Hastings MH) and Gibbs Sampling Rejection Sampling Method Adaptive Rejection Sampling ARS) MCMC sampling Techniques for DPMM Slice Sampling Efficient Slice Sampling

7 2.4 Variational Bayes VB) Inference Approximate Inference Procedure KL-Divergence GEOMETRIC AND STATISTICAL PROPERTIES OF STIEFEL MANIFOLD Geometric Properties of Stiefel Manifold Analytic Manifold Stiefel Manifold Group Action Tangent and Normal Space of Stiefel Manifold Statistical Properties of Stiefel Manifold Probability Distribution on Stiefel Manifold Properties of Matrix Langevin Distribution Computation of the Hypergeometric Function of a Matrix Argument Sampling Random Matrices from Matrix Langevin Distribution on V n,p The Rejection Sampling Method Gibbs Sampling Method BAYESIAN ANALYSIS OF MATRIX-LANGEVIN ON THE STIEFEL MANIFOLD Preliminaries Motivating Example: Dictionary Learning The Stiefel Manifold and ML Distribution Parametric Bayesian Inference for the ML Distribution Likelihood for the ML Distribution Prior for the Polar Part M Posterior for the Polar Part M Prior for the Elliptical or Concentration Part K Upper and Lower Bounds for the 0 F 1 ) Function A Lower Bound Lower Bounds for I 0 x) Remarks Lower Bound for 0 F 1 ) Using Lower Bound for I 0 x) Posterior for the Elliptical or Concentration Part D Rejection Sampling Metropolis-Hastings MH) Sampling Scheme for D Hybrid Gibbs Sampling Experiments on Simulated Data Extension of the Model to a More General K Log-convexity of the Hypergeometric Function A solution Possible ARS Sampling Finite Mixture Modeling Infinite Mixture Modeling DPM Modeling on the Stiefel Manifold

8 4.6.2 MCMC Inference Scheme Variational Bayes Inference VB) on Stiefel Manifold Matrix-Langevin distributions Update equation for γ t Update equation for τ t CG for minimizing F τ) on the Stiefel manifold Update equation for φ n,t Calculated KL-Divergence Experiments Experiments on Synthetic Data Categorization of Objects Classification of Outdoor Scenes BETA-DIRICHLET PROCESS AND CATEGORICAL INDIAN BUFFET PROCESS Multivariate Liouville Distributions Beta-Dirichlet BD) Distribution Normalization Constant by Liouville Extension of Dirichlet Integral BD Distribution Conjugacy With Multinomial Likelihood With Negative Multinomial Likelihood Completely Random Measure CRM) Representation Another Viewpoint for CRM Campbell s Theorem Beta Dirichlet Process BP construction by taking limit from discrete case Construction of BDP Multivariate CRM MCRM) Representation of BDP Beta-Dirichlet process as a Poisson process A Size-biased Construction for Levy representation of BDP BD-Categorical Process Conjugacy Categorical Process CaP) Conjugacy for CaP and BDP CRM Formulation BDCaP Conjugacy With Standard Parametrization BDCaP Conjugacy Using Alternative Parametrization for BDP in the Base Measure G) BDCaP - Conjugacy - Proof Statement Extension of Indian Buffet Process Extension of Finite Feature Model and the Limiting Case BD processbdp) and Categorical Indian Buffer Process cibp) Symmetric Dirichlet Asymmetric Dirichlet Connection BD-NM Conjugacy Negative Multinomial Process NMP)

9 Conjugacy for NMP and BDP CRM Formulation Formal Proof of Conjugacy of NMP for BDP Prior Part Induced Measure Marginal Distribution of Ȳ Via Marked Poisson Process Checking the Integration The Case When ν = Beta-Dirichlet-Negative Multinomial Process as a Marked Poisson Process Experiment with Simulated Data and Results Synthetic Data Inference for BDNM Model Negative Multinomial likelihood is conjugate to Beta Dirichlet prior Negative Multinomial as a mixture of Gamma and multivariate independent PoissonMIP) Posterior inference with Finite approximate Gibbs sampler BD draws Negative multinomial draws Gamma-Poisson conjugacy Inference steps Sampling z d,n, y d,n and c d,n Sampling Ā d,k Sampling b 0,k Sampling ω k DISCUSSION AND FUTURE WORK Future work Related to the First Problem Future work Related to the Second Problem REFERENCES BIOGRAPHICAL SKETCH

10 Table LIST OF TABLES page 4-1 Results for synthetic data set Actual and Estimated number of cluster and accuracy with real data having different number of clusters

11 Figure LIST OF FIGURES page 2-1 The space of data X and the space of all measures MX ) Partition of X CRP representation of DP SB construction for DP Graphical model for DPMM CRM alternative view BP in nutshell Rejection sampling method The tangent and normal spaces of an embedded manifold Lower bound for 0 F 1 a; S) [in red] by RHS of equation 4 14 [in blue]. x-axis represents the sum of eigenvalues and y-axis denotes the function values Lower bound for I 0 x) [in red] by expx 0.77) [in blue]. Note that the inequality I 0 x) > expx 0.77) holds only in the interval [0, 1] This is a approximate profile for posterior density function for a 2 2 diagonal matrix when 100 data points are given Graphical Model for variational inference of DPM log marginal probability of the data increases with number of iterations Confusion matrix for all of the simulated data set Selected 6 object categories from the ETH-80 data set Confusion matrix for ETH-80 data set Selected 3 scene categories from the 8-Scene data set Confusion matrix for Outdoor Scene data set Left side) Poisson process Π on [0, τ] S 2 o with mean measure ν = h µ. The set V contains a Poisson distributed number of atoms with parameter S hdω)µd p). Right side) One draw from BD constructed from Π. The first dimension is the location and other dimensions constitute the weight vector BD-Categorical process with Q = A candidate matrix with Q = 3 has 4 categories namely 0, 1, 2 or

12 5-4 A candidate matrix with Q = 2 has 3 categories namely 0, 1, BD-Negative Multinomial process with Q = The Hierarchical BP-BD-NM process with K = 3 and Q = Top 6 topics and their top 20 words

13 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy TWO MODELS INVOLVING BAYESIAN NONPARAMETRIC TECHNIQUES Chair: Arunava Banerjee Major: Computer Engineering By Subhajit Sengupta May 2013 Deciphering latent structure in data is one of the fundamental challenges that the machine learning community has been grappling with in recent years. Developments in non-parametric Bayesian theory have helped researchers in Machine Learning to build successively better inference techniques to apply to real world problems. In particular, the Dirichlet Process Mixture Model DPMM) has been exploited extensively for unsupervised clustering of data. Although inference techniques have been studied on different analytic manifolds, nonparametric modeling techniques on general analytic manifolds have not been thoroughly explored so far; there has been only limited work on particular special manifolds like the Stiefel and the Grassmann manifold. In the first problem that we discuss in this dissertation, we present a Dirichlet process mixture model framework on the Stiefel manifold to automatically discover the number and membership of clusters. Solutions to both the standard approaches to inference Markov chain Monte Carlo, as well as variational inference are discussed. This novel approach to discovering the hidden structure in the data successfully combines directional statistics, the geometric structure of the space, and Bayesian learning. In support of the theoretical results, some real-world data sets as well as some synthetic data sets are clustered using our algorithm and satisfactory performance is demonstrated. The second problem that we discuss in this dissertation concerns the Latent feature model which has been extensively used in various machine learning applications. One of the important advances in latent feature modeling is the development of a stochastic process called the 13

14 Indian Buffet Process IBP). It defines a prior probability distribution over equivalence classes of sparse binary matrices with finite number of rows and an unbounded number of columns. Researchers have also shown the connection between IBP and an underlying stochastic process called the Beta process BP). In this part of the dissertation, we present an extension to the IBP that removes the binary constraint on the elements of the matrix. We show how this new process, called the categorical IBP, or cibp, is related to another underlying stochastic process called the Beta Dirichlet process BDP), and discuss its properties. Finally, we present the necessary conjugacy and inference techniques for a proposed application to topic discovery. 14

15 CHAPTER 1 INTRODUCTION 1.1 Problem Statements This dissertation can, by and large, be divided into two pieces, with nonparametric Bayesian providing the common thread. The first concerns the application of Bayesian non-parametric inference to data lying in non-euclidean spaces. The second concerns the application of Bayesian non-parametric inference to latent feature models of data. Each piece is described in turn in the following paragraphs. In machine learning, often times, collected data do not lie in Euclidean space. For example, data corresponding to orientation lie on a unit hyper-sphere. In the general case, data may lie on some manifold with non-zero curvature. One such compact matrix manifold is known as the Stiefel manifold. It is the space of ordered p orthonormal vectors, or p-frames, in R n, and can be seen as a generalization of orientation data. Standard probability distributions can not be used to describe the orientation. We have to use a probability distribution that is defined on that manifold. In some situations we are required to cluster these orientation matrices. In [1], we find one such modeling requirement. In the first segment of this dissertation we present a Bayesian nonparametric model on the Stiefel manifold. We seek to cluster data points on the manifold itself. Clustering is a well known problem in unsupervised machine learning, where the goal is to group similar objects together. There are numerous methods that have been proposed to solve clustering. K-means, Mixture of Gaussians, EM are some of the more popular ones. In the context of clustering, nonparametric modeling is particularly appropriate since we might not have prior knowledge of the number of clusters. By applying Bayesian nonparametric techniques one can potentially avoid the model selection problem; the inference techniques are able to find the right number of clusters from the data. Here we formulate and solve a novel clustering model on the Stiefel manifold by first developing theory for the case where the number of clusters is known beforehand, and then augmenting it with theory borrowed from Bayesian nonparametrics. 15

16 The second problem that we tackle in this dissertation is based on the Latent feature model of data. In this model, each object/data point is represented by a vector of latent feature values. The data can be generated from a distribution on the latent feature vector. A generalization to infinitely many features is achieved by the Indian Buffet process IBP) which defines a prior on the space of binary matrices that indicate the possession of a particular feature for an object, with the number of columns in the matrix corresponding to features) being potentially unbounded. The objects are exchangeable, and therefore according to De Finetti s theorem there exists a random probability measure such that the objects are conditionally independent once this measure is given. In [2] the authors have shown that the Beta Process is the underlying De Finetti measure for the Indian Buffet process IBP). Here we show that the beta-dirichletbd) process [3] is the underlying De Finetti measure for a generalization to IBP, a stochastic process which we call Categorical IBPcIBP). Stated informally, this dissertation shows that the BD process plays that role for the cibp that the Dirichlet ProcessDP) and the Beta ProcessBP) play for the Chinese Restaurant ProcessCRP) and the traditional IBP, respectively. Ours is a natural extension or generalization of the traditional IBP where each entry is a categorical random variable instead of a Bernoulli random variable. We have developed a hierarchical Bayesian framework with the BD process and use this connection to develop an efficient inference scheme for cibp. Finally we present an application in topic modeling using this new model. 1.2 Previous Related Work Researchers in Differential geometry have been studying the geometric properties of general Riemannian manifolds. A very good introduction can be found in [4], [5]. At the same time many researchers interested in a Statistical standpoint have explored interesting statistical properties that those same manifolds have. A comprehensive review can be found in [6]. The authors in [7] have explored optimization techniques on the Stiefel manifold, that is, techniques in the presence of orthogonality constraints. A survey of different optimization techniques involved in matrix manifolds can be found in [8]. Shape analysis is an area where statistics 16

17 on manifold has various applications. Recently nonparametric Bayesian density estimation has been applied in planer shape study where the authors have identified the planer shape space with the complex projective space which is the space of all complex lines passing through the origin in an appropriate complex plane [9]. Spatiotemporal dynamical models can also be studied on these special manifolds[10]. Image and video based recognition problems have been formulated in this framework. Recently an intrinsic mean-shift algorithm has been developed on both the Stiefel and Grassmann manifolds, which has found application in object categorization and motion segmentation. Classification techniques on Riemannian manifolds also have been described in a variety of research in computer vision. For the Latent feature model, a good review article on IBP can be found in [11]. The connection between IBP and the Beta process can be found in [2]. [12] and [13] are two other interesting works related to IBP. Concepts from Lévy process and Poisson process are used in various proofs. [14], [15] and [16], [17] are excellent references for Lévy and Poisson process, respectively. Hjort s [18] and Kim s [19] work on the development of nonparametric Bayesian technique in the context of survival analysis are one of the important advancements in this field. [3] is the major motivation for our work in extending the IBP prior in the categorical setting. Within the machine learning community, [20] is an interesting recent related work. 1.3 Organization of the Dissertation In chapter 2, we discuss the Bayesian framework, and in particular the nonparametric Bayesian modeling theory with various existing inference techniques. Chapter 3 contains the geometric and statistical properties of the Stiefel manifold. In chapter 4 we discuss the first problem - Bayesian analysis with Matrix Langevin distribution on the Stiefel manifold. Both Markov chain Monte Carlo MCMC) and variational Bayes inference techniques are formulated and experimental results are presented at the end of the chapter. Chapter 5 is about a new Latent feature model related to the Beta-Dirichlet BD) process. We show conjugacy of the negative multinomial process with the BD process. We also develop a hierarchical model to apply on a synthetic data-set in the domain of topic discovery and show experimental results. 17

18 At then end of this chapter, we explicitly show the connection between cibp and BDP where the second is the De Finetti mixing distribution of the first. Finally in chapter 6, we conclude with our future work plan. 18

19 CHAPTER 2 BAYESIAN INFERENCE AND NONPARAMETRIC BAYESIAN FRAMEWORK 2.1 Bayesian Theory Bayesian inference is one of the most important techniques in mathematical statistics. In the basic Bayesian framework, a prior distribution Π for the parameter Θ reflects one s prior knowledge regarding Θ. The prior is updated by observing the data X 1, X 2,, X n, which are modeled as independent and identically distributed i.i.d) p θ given Θ. The updated distribution for Θ based on the data, is called posterior distribution for Θ and is obtained via the Bayes rule. Posterior probability is derived from the prior probability and the likelihood function. The posterior, like the prior, is a probability measure on the parameter space Θ, which depends on X 1, X 2,, X n. It is relatively easy to find out the predictive distribution which is used to calculate various statistics for the future observations. Let us write down the formal description of Bayesian theory, X = [X 1, X 2,, X n ] are n observed data points Θ β X new is the parameter is the hyper-parameter for the distribution of the prior is a new data whose predicted distribution needs to be computed The prior distribution for Θ is denote by pθ β). The likelihood is denoted by px Θ) = n px i Θ) as the data are drawn i.i.d from p θ. The posterior distribution of Θ is given by pθ X, β) pθ X, β) = px Θ)pΘ β) px β) = θ px Θ)pΘ β) px Θ) pθ β) px Θ)pΘ β)dθ where px β) is called marginal likelihood. The posterior predictive distribution is the distribution of a new data point X new, marginalized over the posterior distribution and given by, px new X, β) = px new Θ)pΘ X, β)dθ θ 19

20 2.1.1 MAP Estimate In Bayesian statistics [21], the mode of the posterior distribution is known as maximum a posteriori MAP). MAP is used to obtain a point estimate of the parameter based on the observed data. This estimation procedure can be seen as a regularized Maximum Likelihood estimation. This can be also seen as an optimization procedure over the parameter space where the parameters now have a prior distribution. MAP estimate is given by, ˆθ MAP = arg max Θ pθ X, β) When the modes of the posterior distribution can be written in a closed analytical form, computation of MAP estimate is easier. This brings the notion of conjugate prior Conjugate Prior When the posterior distribution which is given by pθ X, β) is in the same family as the prior probability distribution pθ β), the prior and posterior are called conjugate to each other and the prior is called a conjugate prior for the likelihood px Θ). Conjugate prior is an algebraic convenience so that the posterior distribution can be expressed in a closed form. But it is not a good idea to just choose any prior distribution that works analytically, conjugate prior distribution should be chosen such that it adequately describes investigator s knowledge of the unknown parameters) before any data is obtained. For example, Beta distribution is the conjugate prior for Bernoulli or Binomial likelihood, Dirichlet distribution is the conjugate prior for Multinomial likelihood etc. In fact, if the likelihood function belongs to the exponential family, then there always exists a corresponding conjugate prior distribution. 2.2 Nonparametric Bayesian In parametric Bayesian theory, the form of the distribution for each class of data is assumed. This is in some way very restrictive as well as inflexible. In this case, the number of parameters does not depend on sample size. On the other hand, a Bayesian nonparametric model is a Bayesian model on an infinite-dimensional parameter space. We know that model complexity is often represented as the number of parameters. on the other hand, in case of 20

21 Bayesian nonparametric [22] modeling, the effective model complexity adapts to the data and hence it is very flexible. In machine leaning research, Bayesian nonparametric models have recently been studied and applied to a variety of problems that include clustering, classification, regression, density estimation, image segmentation, document processing, topic modeling Motivation and Theoretical Background Most of the recent machine learning problems deal with uncovering the patterns or structure of the data. One of the important problem is unsupervised learning where we need to find the appropriate set of parameters for a model class without any training examples. The major challenge of machine learning is to identify all the model classes with the suitable parameters those are responsible for generating the data - this is the so called the problem of model selection. To be precise, in the setting of unsupervised learning, it is the number of unknown clusters like - finding the unknown number of state in the case of hidden Markov model. Firstly, we would like to point out that nonparametric actually means unbounded number of parameters. We use an infinite-dimensional parameter space and use only a finite subset of the available parameters based on any given finite data set. This subset generally grows with the data set. Ferguson 1974) [23] was the one who first explored the Bayesian approach for nonparametric problems. In one of his seminal papers he mentioned about two desirable properties for the prior distribution for nonparametric problems: The support of the prior distribution should be large with respect to the suitable topology in the space of probability distribution on the sample space. Posterior distribution given all the samples from a true probability distribution should be manageable analytically. Bayesian prior is formulated as a distribution on the space of probability distributions. The distribution on the space of probability distributions or probability measures is a stochastic process. Parametric Bayesian - form of the distribution for each class is assumed. Nonparametric Bayesian - parameter space contains set of all probability measure on X - MX ). 21

22 The Bayesian prior is then formulated as a distribution on the space MX ). Prior on MX) MX) X 1 X 2 X n X Figure 2-1. The space of data X and the space of all measures MX ) Exchangeability A sequence of random variables X 1,, X n over the same probability space X, BX ), µ) are called exchangeable if the joint distribution of those random variables is invariant to permutation. If P is the joint distribution and σ is any permutation of {1, 2,, n}, then invariance of permutation gives, PX 1 = x 1, X 2 = x 2,, X n = x n ) = PX 1 = x σ1), X 2 = x σ2),, X n = x σn) ) An infinite sequence of random variable are infinitely exchangeable if X 1, X 2,, X n are exchangeable for all n 1. This assumption is very common in any application of machine learning or applied statistics. It is much weaker assumption than i.i.d random variables. Clearly, i.i.d implies exchangeability but the converse is not true De Finetti Theorem If X 1, X 2, X n ) are infinitely exchangeable, then the joint probability PX 1, X 2, X n ) has a representation as a mixture distribution for some random variable θ. The joint probability is given by: n PX 1, X 2,, X n ) = Pθ) PX i θ)dθ If we assume a prior on the underlying random latent parameter θ, the data are conditionally i.i.d given the latent parameter. In the case of Dirichlet process, Pθ) is a distribution on 22

23 space of probability measure. So De Finetti theorem implicitly defines the stochastic process underlying the Bayesian nonparametric model. This general version of De Finetti s result was proved by Hewitt and Savage 1955) [24] Dirichlet Distribution and Dirichlet Process Dirichlet distribution is a multivariate generalization of Beta distribution. It is a distribution over K-dimensional probability simplex. Let p = p 1, p 2,, p k ) be a K-dimensional probability vector, such that for all i, p i 0 and K p i = 1. Let us write Dirichlet distribution with parameter α = α 1,, α K ) as Probp α) = Dirα 1,, α K ) with E[p j ] = r s. α j K, Var[p α j ] = i Dirα 1,, α K ) = Γ K α i) K Γα i) α j α j K α i ) K α i ) 2 K K p α i 1 i and Covp α α i +1) r, p s ) = r α s K α i ) 2 for K α i +1) One special case of this distribution is the symmetric Dirichlet distribution where all the parameters α i is equal to α. The Dirichlet distribution is the conjugate prior distribution of the categorical distribution and multinomial distribution. So if data likelihood follows categorical or multinomial distribution and the prior distribution follows a Dirichlet distribution, then the posterior would be a Dirichlet. Dirichlet process is the infinite generalization of Dirichlet distribution. Dirichlet process DP) [25] is a random probability measure G over X, BX )) such that for any finite set of measurable partition of X = A 1 A 2 A N GA 1 ), GA 2 ),, GA N )) DirαA 1, αa 2,, αa N ) where α is the base measure. DP is a distribution over probability measures such that marginals on finite partitions are Dirichlet distributed. Now this immediately leads us to the question of existence of a Dirichlet process. In other words, how to construct the stochastic process from these family of marginal distributions. The answer to that is Kolmorogov Consistency or extension theorem. Here is the formal statement of the theorem [26]: 23

24 A 1 A 2 A 3 A 4 A 5 A 6 Figure 2-2. Partition of X Proposition 2.1. Consistency theorem): Let T denote some time interval and let n N. For each k N and finite sequence of times t 1, t 2,, t k T, let ν t1,t 2,,t k be a probability measure on R n ) k. Suppose that these measures satisfy the following two consistency conditions: For all permutations σ of {1, 2,, k} and measurable sets F i R n, ν tσ1),t σ2),,t σk) F σ1) F σ2) F σk) ) = ν t1,t 2,,t k F 1 F 2 F k ) For all measurable sets F i R n, m N ν t1,t 2,,t k F 1 F 2 F k ) = ν t1,t 2,,t k,t k+1, t k+m F 1 F 2 F k R} n {{ R} n ) m times Then there exists a probability space Ω, F, P) and a stochastic process X : T Ω R n such that ν t1,t 2,,t k F 1 F 2 F k ) = PX t1 F 1, X t2 F 2, X tk F k ) for all t i T,k N and measurable sets F i R n, i.e. X has ν t1,,t k as its finite dimensional distributions relative to times t 1,, t k. In other words, given the above consistency conditions, there exists a unique) measure ν on R n ) T with marginals ν t1,,t k for any finite collection of times t 1,, t k. This theorem can be used to prove the existence of DP. A DP G has two parameters: Base distribution H, which is like expectation of DP. 24

25 Concentration parameter α, which is like inverse of variance of the DP. The Expectation and variance for DP are E[GA)] = HA) and V [GA)] = HA)1 HA)) α+1, respectively, where A is any measurable subset of X Posterior Distribution for DP Let G DPα, H). Since G is a random distribution, let θ 1, θ 2,, θ n be the draws from G. The posterior distribution of G given observed values of θ 1, θ 2,, θ n. Let A 1, A 2,, A N be a finite measurable partition of X, and let n k be the number of θ i such that θ i A k. By the definition of DP and the conjugacy between the Dirichlet and the multinomial distributions it can be shown that: GA 1 ), GA 2 ),, GA N ) θ 1,, θ n ) DirαHA 1 ) + n 1,, αha N ) + n N ) Since the above is true for all finite measurable partitions, the posterior distribution over G must be a DP as well. The posterior DP has the concentration parameter α + n) and base distribution αh+ n δ θ i α+n, where δ i is a point mass located at θ i and n k = n δ ia k ). So rewriting the posterior DP, we have: G θ 1, θ 2,, θ n ) DP α + n, α α + n H + n n δ ) θ i α + n n Note that the posterior base distribution is a weighted average between the prior base distribution H and the empirical distribution n δ θ i n. As the amount of observations grows large i.e. n α, the posterior is simply dominated by the empirical distribution which is in turn a close approximation of the true underlying distribution by Glivenko-Cantelli lemma). This gives a consistency property of the DP which simply implies that the posterior DP approaches the true underlying distribution. Apart from the abstract definition of DP, it has two more representations: Polya s Urn scheme or Chinese Restaurant Process CRP) and Stick Breaking SB) construction. 25

26 Polya s Urn Scheme or CRP Let µ be any finite positive measure on a completely separable metric space X. Also we have µ which is a random draw from a DP with parameter µ, which is almost surely discrete. A sequence {X n, n 1} of random variables with values in X is a Polya sequence with parameter µ if for all B X PX 1 B) = µb) µx ) PX n+1) B X 1, X 2, X n ) = µ nb) µ n X ) where µ n ) = µ ) + n δx i) ) and δx) is the unit measure concentrating at x. For finite X, the urn scheme can be described as - the sequence {X n } can be seen as results of successive draws from an urn which has initially µx) number of balls of color x, after each draw the ball is replaced and another ball of the same color is added to that urn. X n+1 represents the color of the ball drawn at n + 1)-th draw and PX n+1 ) denotes the distribution of this event. Proposition 2.2. Blackwell and MacQueen s construction [27] of DP via Polya s Urn scheme) : Let X n be a Polya sequence with parameter µ. Then m n = µ n µ n X ) converges with probability 1 as n to a limiting discrete measure µ, µ is a draw from a DP with parameter µ and given µ, the random variables X 1, X 2, are independent with distribution µ. In other words, we can realize a draw G from a DPα, H) as a random probability measure. Treating G as De Finetti measure, let θ i be the i.i.d draws from G. By marginalizing out G, we can write down the distribution of θ n+1 conditioned on θ 1, θ n, which is called a Polya s urn scheme, as, θ n+1 θ 1, θ 2,, θ n ) 1 n + α n δθ i ) + α n + α H 26

27 Next comes the description of DP as CRP which clearly shows the clustering property of a DP. If we draw θ 1, θ 2, θ n from a Polya s urn scheme and as DP is discrete with probability 1, so there will be almost surely m < n distinct values of {θ i } n namely θ 1, θ 2,, θ m. This gives a partition of {1, 2, n} into m clusters such that if θ i is in cluster j, then θ i = θj. As the draws for Polya s urn scheme is random so this also induces a random partition of {1, 2, n}. In order to show the clustering property explicitly, this process can be explained metaphorically by the following - One Chinese restaurant has infinite tables and each table has infinite capacity. Let the tables be denoted by k = 1, 2, Customers are indexed by i, with values φ i and tables have values θ k drawn from a DP G. Let us also denote total number of occupied tables so far, total number of customers so far and number of customers seated at the table k by K,n and n k, respectively. Algorithm 1 CRP construction of DP Customer 1 enters the restaurant and sits at table 1. φ 1 = θ 1 where θ 1 G, K = 1, n = 1, n 1 = 1. for i = 2, 3, do Customer n sits at table k for k = 1 K) with probability n k n 1+α α n 1+α Customer n sits at table K + 1 new table) with probability if Customer n sits at a new table then K K + 1, θ K+1 G end if set φ n to θ k of the table k that customer n sat at; set n k n k + 1. end for The resulting conditional distribution over φ n : φ n φ 1, φ 2,, φ n 1, G, α K α n 1 + α G ) + n k n 1 + α δ θ k ) 27

28 φ 1 φ 4 φ 2 φ 3 φ 7 φ 5 φ6 θ 1 θ 2 θ 3 Figure 2-3. CRP representation of DP It can be shown for large enough n - Expected number of clusters Em) = n α α + i 1 α log1 + n α ). Note that the number of clusters grows only logarithmically in the number of observations. This growth is slow and it also indicates the rich-gets-richer phenomenon. α directly controls the number of clusters, with larger α implying a larger number of clusters a priori Stick Breaking Representation of DP In 1994, Sethuraman [28] gave a constructive definition of DP. It is important in the sense that it give the the definition about how the actual draw from a DP would look like and this construction helps in the simulation of the process. Let G be a draw from a DP with base measure H and concentration parameter α, i.e. G DPα, H). As G is discrete with probability 1, which implies where k=1 π k = 1 and G = π k θk k=1 k 1 π k = β k 1 β j ) β k Beta1, α) θk H ) The SB construction of can be understood as follows. Starting with a stick of length unity, break it at β 1, assigning π 1 to be the length of stick we just broke off. Now recursively break 28

29 the other portion to obtain π 2, π 3 and so forth. The SB distribution over π is sometimes written π GEMα), where the letters stand for Griffiths, Engen and McCloskey. H θ θ 1 θ 2 θ 3 θ θ 4 θ 5 β 1 1 β 1 π 1 β 2 1 β 2 π 2 β 3 1 β 3 π 3 Figure 2-4. SB construction for DP Pitman-Yor process is another example of SB process where β k Beta1 a, b + ka). This is a generalization of DP because in the case of DP a = 0 and b = α. Also there is another process which is a Beta two-parameter process where β k Betaa, b) Dirichlet Process Mixture Model DPMM) DP gives rise to something called atomic distribution with probability 1. As it is not possible to generate an absolutely continuous distribution with a DP prior so researchers studied Mixture distribution with DP prior. In [29], DPMM were first applied in the problem 29

30 in bio-assay and also in [27], predictive distribution of the DP are shown to be related with Blackwell-MacQueen urn scheme. Later in [28], the constructive definition of DP via a stick breaking process was given and this explicitly gives the formula to sample from a distribution which has a DP prior. Due to the atomic nature of DP, clustering with this prior has been widely used. If G DPα, H) and i.i.d sequence θ 1, θ 2,, θ n G, the posterior distribution of G given the data and the prior H takes the simple form pg y 1,, y n ) DPαHA 1 ) + n 1,, αha n ) + n n ) where n 1, n 2,, n n represent the number of observations falling in each of the partitions A 1, A 2,..., A n respectively, n is the total number of observations, and δ Yi represents the point mass function at the sample point Y i.we can also write down the predictive distribution for θ n+1 θn+1 θ1, θ2,, θn) 1 αh + α + n ) n δ θi This representation is called the chinese restaurant process, which we have already seen. This nonparametric modeling is suitable in many situations because it can potentially model infinite number of clusters and it is useful where we do not have the number of clusters before hand which includes our application as well. The graphical model for a general DPMM model is given below: 2 1) G DP α, H) θ i G i = 1, 2,, n x i f θ i ) i = 1, 2,, n For example, if f θ) is a Gaussian density with parameters θ, then it is called a DPMM of Gaussians. DPMM is the infinite generalization of a latent mixture model where mixing proportions follows a Dirichlet distribution and distribution of class indicators is multinomial. 30

31 α H G θ i x i n Figure 2-5. Graphical model for DPMM Let z i be a cluster assignment variable, which takes on value k with probability π k. Then above equations can be equivalently expressed as: π α GEMα) z i π multπ) θk H H x i z i, {θk} f θz i ) with G = k=1 π kδ θ k and θ i = θ z i. In mixture modeling terminology, π is the mixing proportion, θk are the cluster parameters, F θ k ) is the distribution over data in cluster k and H the prior over cluster parameters. In the DP mixture model, the actual number of clusters used to model data is not fixed, and can be automatically inferred from data using the usual Bayesian posterior inference framework which can be done in two different ways: Markov chain Monte Carlo MCMC) inference procedure. 31

32 Variational Bayes VB) inference procedure. Later in the sampling section and in chapter 4 we will discuss each of the methods in detail Beta Process BP) in 1990, Hjort [18] introduced Beta process in the context of survival model. In that set up he worked with right-censored data. Let X 1,, X n be i.i.d with unknown cumulative distribution functioncdf) F and data may be subject to right censoring. The problem was to construct nonparametric Bayes estimator for F. This involves placing a prior probability distribution of space F of all cdf s and it is equivalent to treat F as a stochastic process. Hazard rate αt) is defined as the ratio F t) F [t, )] and cumulative hazard function CHF) can be defined as At) = t αs)ds. From these two definition one can see, 0 At) = t 0 df s) 1 F s ) F t) = 1 s [0,t] 1 das)) Thus F t) is defined via a product integration. Hjort wanted to do the nonparametric Bayesian estimation of A in the survival model with right censored data. He derived Beta process BP) and it turns out that this class gives a suitable prior distribution for A with large support and tractability. Beta processes produce cumulative hazard rates with independent increments and approximately beta distributed. Beta process prior forms a much richer and flexible class than Dirichlet process prior. Let X 1,, X n F are the survival times. Let us also denote the censoring time by c 1,, c n C. Let T i = minx i, c i ) and δ i = IX i c i ) for i = 1,, n. Now the problem statement of Hjort was given {T i, δ i } n, how to estimate F? i.e. under nonparametric Bayesian set up how to find a suitable prior for F on F, which is the set of all probability distribution on R +. In the previous subsection, we have seen Dirichlet process DP) which also aims to give a prior distribution on F. The problem with DP was it produces a conjugate model when the data are not censored. So particularly with right-censored data DP is not conjugate. For this purpose BP has been shown to produce a conjugate model. 32

33 A CHF can be defined as the following At) = t 0 PX = s X s) ds A is a non-decreasing function with A0) = 0 and lim t At) = and let us denote At) = At) At ). CHF property gives us At) [0, 1]. We also know the one-to-one relationship between F t) and At) as given by the product integral. Hjort defined his Beta process in the continuous case by taking limits of time-discrete model. We will see similar type of discrete process when we will discuss another stochastic process namely Indian Buffet Process IBP). Unlike the Dirichlet process which is defined by the distribution of the joint probabilities of any finite measurable partition of [0, ), the Beta process is best defined through the notion of Lévy process which has the following properties: It is almost surely non-decreasing. It has non-negative independent increments. It is almost surely right continuous. The limit goes to infinity as t. It is zero at the origin. Now this type of processes are termed as subordinators. Its path has finite variation. For such a process there can exist at most countably many number of fixed points of discontinuity at time t 1, t 2, with jumps s 1, s 2, which are independent non-negative random variables. Then A c t) = At) t k t s k is a non-decreasing process with independent increments and with no fixed points of discontinuity and therefore can be represented by Lévy-Khintchine formula. Using this, A c t) can be represented by its characteristic function { EiφA c t)) = iφat) + exp 0 { t = iφat) + exp 0 e iφx 1 ) } dl t x) e iφx 1 ) } νds dx) 0 33

34 where a is non-decreasing and continuous, with a0) = 0. L t x) is called Lévy measure which satisfies the following properties: for every Borel set B, L t B) is continuous and non-decreasing. for every real t > 0, L t.) is a measure on the Borel sets of 0, ). A c t) is finite a.s whenever 0 0 x dl 1+x tx) 0 as t 0. x dl 1+x tx) is finite. Since a represents a nonrandom component, so it can be considered to be identically zero and note that the Brownian motion part of the Lévy process has to be zero for a subordinator. Then it can be shown that any subordinator is a limit of a compound Poisson process. Let Nt) be a Poisson process with mean Λ = t λs) ds. Let Y t) be independent random s=0 variable with distribution function G. A compound Poisson process is given as A c t) = s t Y s)i Ns) = 1) Any subordinator can be approximated by letting Λ becomes large and Y t) becomes small. According to Kim s paper [19], it is known that if the following conditions hold, ν[0, t] D) = t 0 D df s x) dx = t 0 D f s x) dx) ds where for all t, t 1 xf 0 0 sx) dx ds <. Then there exists an unique subordinator whose Lévy measure will be given by ν. We can always add a set U = {u 1,, u l } with finite number of fixed discontinuity points so that the modified Lévy measure could be represented by ν[0, t] D) = dl t x) + D t j t D dh j x) where H j x) is the distribution function of jump variable U j corresponding to u j. So we can write At) = A c t) + A d t) where fixed jumps are only responsible for the part A d t). With this set up, if U =, A becomes a compound Poisson process such that Jt) = s t I As) 0) is a Poisson process with intensity function λt) where λt) = 1 0 f tx)dx 34

35 and conditional distribution of At) conditional on At) > 0 is Ftx). This is only true when λt) λt) <. When it is infinite we have to approximate At) by taking limits of sequence of compound Poisson processes. When U, A becomes an extended compound Poisson process with fixed discontinuities at {u 1,, u l }. So the important features of Beta process are the following: Beta process priors put measures on the space of CHF rather than the space of distribution function. We can take the prior class to be a subordinator. It is a special form of subordinator which is conjugate to right-censored data. As At) [0, 1], so a possible prior can be beta distribution. For BP, At) Beta0, ct)) or incomplete beta distribution and ct) is a non-negative function often taken to be positive constant. So BP is a subordinator with Lévy measure where EAs)) = A 0 s) νds dx) = cs)x 1 1 x) cs) 1 dx da 0 s) If A 0 has both continuous and discrete parts, then BP can be decomposed into At) = A contn t) + A discr t) where they are independent. A contn t) is a BP with parameters A contn 0 t) and ct) and A discr t) is s t H s. H s are independent and distributed as Variance of BP is given by Betacs) A discr 0 s), cs)1 A discr 0 s))) VarAt)) = t 0 da 0 s)1 da 0 s)) 1 + cs) Hjort also gave the posterior distribution when the data are right-censored data. Let Nt) = n IX i t, δ i = 1) and Y t) = IX i t). Now if A is a BP prior with parameters ct) and A 0 t) then the posterior of A denoted by A post has the updated 35

36 parameters as A post 0 t) = t c post t) = ct) + Y t) 0 cs) t cs) + Y s) da 0s) + 0 dns) cs) + Y s) More details on BP can be found in [22] Completely Random Measure CRM) By Kingman s [30], [17] result it is known that BP is one particular instance of a general family of random measures known as Completely Random Measure or CRM. A CRM φ on a probability space Ω, F is a random measure such that for any disjoint measurable sets B 1,, B n F, the random variables φb 1 ),, φb n ) are independent. CRM can be decomposed into three parts [30]. φ = φ f + φ d + φ o, where φ f is the fixed atomic component, φ d is the deterministic component and φ o is the ordinary component. Note that φ o is purely atomic and atoms of φ o follow a superposition of independent Poisson process hence a Poisson process. φ o is not necessarily sigma-finite. CRM can be obtained from an underlying Poisson point process. Let us denote a sigma-finite measure νdp dω) on the product space [0, 1] Ω and draw a collection of points {p i, ω i } from the Poisson point process with mean measure ν. Now we can construct a random measure as follows: µ = p i δ ωi where δ ωi denotes a point mass at ω i. For any measurable set T, the discrete random measure is given by µt ) = ω j T p j Another viewpoint of CRM CRM can be thought of as a functional of a Poisson random measure [31], which can be explained with the help of the figure below. Here B ) denotes the respective sigma field. 36

37 BS) A 1 A 2 S = R + X A NA) PoiνA)) 1 0 νdp,x < ν[1, ) X) < NA 1 ) and NA 2 ) are independent MX),BM X ) Complete Separable Metric space µc) = R + C pndp,dx) X R +,BR + ) C µ a linear functional of Poisson random measure µx 1 ) and µx 2 ) are independent X 1 BX) X 2 Figure 2-6. CRM alternative view Note that, we could have defined more general linear functional like hp)ndp, dx) S X where S is a separable complete metric space and h : S R +. They are known as h-biased random measures. Here we are using a very simple h, which is hp) = p, so it is sometimes called sizebiased random measure BP If B is a beta process such that B BPc, A 0 ), then B is a CRM with the ordinary component φ o and the corresponding Poisson process rate measure is νdp dω) = cω)p 1 1 p) cω) 1 dp A 0 dω) 37

38 We here assume that A 0 is absolutely continuous and c.) is a non-negative function. For simplicity c.) is taken to be a constant R and called the concentration parameter of BP. This parameter actually acts as a precision parameter. A 0 is a non-negative base measure. Total mass of A 0 Ω) = γ is called the mass parameter of BP and assumed to be finite and positive. Note that, the [0, t] space in Hjort has been generalized by Thibaux and Jordan [2] to an abstract space Ω. Like DP, µ is almost surely discrete. The pair p i, ω i ) correspond to a location ω i Ω and its weight p i [0, 1]. Now ν[0, 1] Ω) is typically which says the underlying Poisson process generates infinitely many points but µt ) is finite if A 0 is finite. If A 0 is a discrete measure of the form A 0 = q iδ ωi, q i [0, 1]) then µ has atoms in the same locations almost surely and can be written as p iδ ωi, where p i Betacω i )q i, cω i )1 q i )) This formulation is used in hierarchical models Bernoulli process BeP) and Indian Buffet Process IBP) In [2] the authors defines a Bernoulli process with hazard measure µ, which can be written as X BePµ). If µ is continuous, then X is nothing but a Poisson process with mean measure µ. X = N δ ωi where N PoissonµΩ)) ω i are independent draw from the probability distribution µ. If µ is discrete e.g is a draw µω) from any CRM) with µ = p iδ ωi, then X = g iδ ωi and g i are independently Bernoulli distributed with parameter p i. These authors also discussed the conjugacy of BP with BeP. Let µ BPc, A 0 ) and there are n independent draws denoted by {X i } n from BePµ). Note that, each X i is now a Bernoulli process. Then the posterior distribution of µ given {X i } n is a BP with parameters µ {X i } n c BPc + n, c + n A c + n n X i ) 38

39 Connection to IBP IBP was introduced in [32], [11]. It is a stochastic process which can be viewed as a factorial analog of CRP and it generates a exchangeable prior distribution on binary matrices with infinite columns. Metaphorically, the process can be described as follows. There is a sequence of customers tasting dishes in an infinite buffet. Let Z be a infinite column binary matrix and Z i be the i-th customer. Z ik = 1 if i-th customer tastes k-th dish. i-th Customer tastes k-th dish with probability m k i k, where m k is the previous number of customers who have already sampled k-th dish. After that i-th customer tastes an additional number of dishes drawn from Poisson α ). In these cases, the abstract space Ω is the space of features. In i particular, for IBP it is the dishes. By marginalizing the underlying De Finetti measure, which is BP in case of IBP, it was shown in [2] that c X n+1 X 1, X 2,, X n Bep c + n A 0 + j m n,j c + n δω j) where m n,j is the number of customers who have tried dish ω j. Also number of new dishes for x n+1 is distributed as Poisson cγ ). This is basically a two-parameter generalization of original c+n IBP which has only one parameter α. In this case c = 1 and γ = α. It can be easily shown that total number of unique dishes is Poissonγ n c c + n ) Poissonγ + γ c log c + i c + 1 ) This term goes to Poissonγ) and Poissonnγ) as c tends to 0 and, respectively BP in a Nutshell The following figure 2-7 is a pictorial description of BP and all other connections from it. 2.3 Markov Chain Monte Carlo Sampling Markov chain Monte Carlo MCMC) [33] is a strategy for generating samples x i) by exploring the state space X using a Markov chain. The chain is constructed so that the it converges to the target distribution f x). Evolution of the chain only depends upon 39

40 Stochastic Process Lévy Process Independent increments Stationary increments t Lévy Measure: dl tp) = 0 cω)p 1 1 p) cω) 1 dpda 0 ω) Subordinator Non decreasing Sample path and bounded variation underlying Poisson process has mean measure ν Compensator of A is ν Superposition of countable Poisson Processes Has a specific Stick Breaking form Can be decomposed via Lévy-Khintchine decoposition Completely Random Measure CRM) Linear Functional of Poisson Random measure size bias random measure Beta Process At) De Finetti mixing distribution of IBP Application in latent feafure model in ML Prior for Spase linear model in ML Compound Poisson Process when ν < Prior on the soace of CHF Limit of a sequence of Compound Poisson Processes when ν = Extended Compound Poisson Process if it has finite number of fixed discontinuity Conjugate to right censored data Functional form has an incomplete Beta distribution Has cadlag path right continuous, left limits) Weak convergence on the space with Skorokhod topology Figure 2-7. BP in nutshell the current state and a fixed transition matrix discrete state space) or transition kernel continuous state space). Two important conditions regarding Markov chain are: irreducibility: from any state of the Markov chain, there is a positive probability for visiting any other state. aperiodicity: the Markov chain should not have any cycle. positive recurrent: A state is said to be transient if, given that we start in that state there is a non-zero probability that we will never return to that state. This time is also called hitting-time. The state is recurrent if is not transient. If the hitting time of a state has finite expectation, it is called positive recurrent. An irreducible Markov chain has a stationary distribution if and only if all of its states are positive recurrent. If a Markov chain is irreducible, positive recurrent, and aperiodic then for any initial probability distribution, the Markov chain always eventually reaches the stationary distribution. One way to design the MCMC sampler is such that the stationary distribution 40

41 happens to be the target distribution. One way to ensure that is to satisfy the reversibility or detailed balance condition Metropolis-Hastings MH) and Gibbs Sampling In Metropolis-Hastings MH) algorithm with stationary distribution f x) and proposal distribution qx # x), each step involves sampling a new candidate value x # given the current value x according to qx # x). The probability of accepting the new value x # is Ax #, x) = min1, f x # )qx x # ) f x)qx # x) ) and with probability 1 Ax #, x)) the chain remains at the old value x. Gibbs sampler is one special type of MH sampler where the acceptance probability for each proposal is always 1. It can be designed when it is possible to sample from the full conditional distribution. Here is the basic Gibbs sampler - Algorithm 2 Gibbs sampling Initialize x 1, x n by x1 0, xn 0 for r = 1 to N 1 do for j = 1 to n do end for end for Sample x i+1) j f x j x i+1) 1,, x i+1) j 1, x i) j+1,, x i) n ) Rejection Sampling Method Rejection sampling method is one of the important techniques in order to sample from a distribution px), which is known up to a proportionality constant. Let qx) is a distribution where from it is easy to sample and for all x, px) < Mqx) and M is finite. The rejection sampling method is given by the following algorithm: It can be easily shown that accepted x i) Algorithm 3 Rejection sampling Initialize i = 1 for doi = 1 to N Sample x i) qx) and u Unif 0, 1) if u < px i) ) then accept x i) and do i = i + 1, otherwise reject. Mqx i) end for has the probability distribution px). This method has a severe drawback. If M is too large, 41

42 Mqx i) ) px i) ) reject region accept region umqx i) ) x i) qx) Figure 2-8. Rejection sampling method then the acceptance probability px accepted) = pu < px) Mqx) ) = 1 M it impractical for high-dimensional data Adaptive Rejection Sampling ARS) is too small which makes ARS as introduced in [34] is a efficient technique of sampling over the standard non-adaptive rejection sampling scheme. It also generates i.i.d samples. The density needed to be specified only up to a constant of integration. The requirement of ARS is that the sampling function f x) has to be a log-concave function. Thus it avoids the need to locate supremum of the envelop function. Also after each rejection, the envelop function is updated by incorporating most recently acquired information about envelop function. the rejection envelop and the squeezing function are piecewise exponential function. ARS works very well in univariate distribution case. In our model, we have proved log-concavity of a multi-dimensional function. But applying ARS sampling technique is multi-dimension is not straight forward as the integration of the volume generated by the intersection of the hyperplanes is very difficult to calculate. Adaptive rejection Metropolis sampling within Gibbs sampling [35] is another important variation of ARS sampler MCMC sampling Techniques for DPMM Inference for DPMM has become feasible with the development of MCMC methods which helps to sample from posterior distribution of the parameters associated with the 42

43 DPMM model. Gibbs sampling method can be easily implemented for conjugate prior distribution. We here choose to implement that particular method which can be adapted easily for non-conjugate model as well. Practical method regarding DPMM first developed by Escober and West [36] and MacEachern and Muller [37]. The basic model looks like: y i θ i F θ i ) θ i G G G DG 0, α) The data y 1, y 2,, y n are exchangeable and drawn from a mixture of distributions denoted by F θ i ). G is the mixing distribution over θ. Now the prior for G is a Dirichlet process with a concentration parameter α and base distribution G 0. The development of MCMC algorithms for DPMM comes from the CRP type of description of it. Due to intractability, exact computation of posterior expectation can not be done. Although it can be estimated using Monte Carlo methods. We can sample from the posterior distribution of θ i s by simulating a Markov Chain which has this posterior distribution as its stationary distribution. Expectation for predictive distribution for a new observation can also be computed in similar manner. The following two algorithms are adapted from Neal2000) [38]. One can use the first one only when the conjugate priors are used because it involves an integration with respect to the base measure which can not be done analytically in case of non-conjugate priors. In the case with non-conjugate priors, sometimes Monte Carlo integration is used to approximate the integral. But the error might be high in many situations. Here is the basic idea of how auxiliary variables [38] are used in MCMC algorithms. We can sample from a distribution P x of x by sampling from some joint distribution P xy for x, y) and then discarding y. Note that the marginal distribution of x is P x. x is the permanent state of the MC and an auxiliary variable y is introduced temporarily in the update step. The following are the general steps for the MCMC sampling using auxiliary variable: 43

44 Algorithm 4 Algorithm for Gibbs sampling - case of conjugate prior Let the state of the Markov chain consists of c 1, c 2,, c n Let θ = θ c : c {c 1, c 2,, c n }). Repeatedly sample as follows: for doi = 1 to n if t thenhe present value of c i is associated with no other observations remove θ ci from the state. end if Draw a new value for c i from c i c i, y i if thenc = c i for some j i Pc i = c c i, y i, θ) = b n i,c n 1+α F y i, θ c ) else α Pc i c j j i c i, y i, θ) = b n 1+α F yi, θ)dg 0 θ) end if if t thenhe new c i is not associated with any other observation draw a new value for φ ci from H i and add it to the state. where H i is the posterior of θ given the prior G 0 and data y i. end if end for for a doll c {c 1, c 2, c n } Draw a new value from θ c y c where y c = {y i s.t.c i = c} end for From the joint distribution P xy, find out the conditional distribution of y x and draw a value for y. do some update on x, y) leaving P xy invariant. finally get the updated value of x discarding y. Clearly, the update for x will leave P x invariant and the chain will converge to P x. So for DPMM with non-conjugate priors the following algorithm was proposed by Neal [38] Slice Sampling One of the popular techniques for sampling DPM models was first described by Walker in [39]. Original slice sampling algorithm reply on integrating out the random distribution function from this model. These are called marginal methods. The slice sampling technique is based on the idea that allows to sample a sufficient but finite number of variables in each iteration of a valid Markov chain with correct stationary distribution. These type of algorithms are called conditional methods. They are typically very simple to implement. 44

45 Algorithm 5 Algorithm for Gibbs sampling with m auxiliary parameters Let the state of the Markov chain consists of c 1, c 2,, c n Let θ = θ c : c {c 1, c 2,, c n }). Repeatedly sample as follows: for doi = 1 to n Let k be the number of distinct c j for j i. Let h = k + m. Label these c j with values in {1, 2, k }. if thenc i = c j for some j i, draw values independently from G 0 for θ c for k < c h. end if if thenc i c j for all j i, let c i has label k + 1. Draw values independently from G 0 for θ c for k + 1 < c h. end if draw a new value for c i from {1, { 2, h} using the following probabilities: b n i,c n 1+α Pc i = c c i, y i, θ 1,, θ h ) = F y i, φ c ) for 1 c k α m b F y n 1+α i, φ c ) for k < c h where n i,c is the number of c j for j i those are equal to c. b is the appropriate normalizing constant. Change the state to contain only those θ c that are associate with one or more observations. end for for a doll c {c 1, c 2, c n } Draw a new value from θ c y c where y c = {y i s.t.c i = c} Or perform some other update to θ c that leaves this distribution invariant. end for Efficient Slice Sampling The usual slice sampling technique mix slowly due to the correlation between u and w, where u is the auxiliary variable and w is the mixture weights. By doing blocking of appropriate variables that problem can be avoided. A general class of slice sampler can be defined with a help of an auxiliary positive sequence which can be deterministic. Due to the nature of these algorithms, they can be applied to more general priors. The details has been nicely described in [40]. 45

46 2.4 Variational Bayes VB) Inference Approximate Inference Procedure The goal of the nonparametric Bayesian inference is to compute the posterior distribution PW X, θ), where observations are X = {x 1, x 2,, x n }, latent variables are W = {w 1, w 2,, w k } and hyper-parameters are θ. Using Bayes rule we have, PW X, θ) = exp{logpx, W θ) logpx θ)} The posterior distribution is intractable when DPM prior is used. Two types of approximate inferences techniques are popular in recent times. One is to sample from the intractable posterior with Markov chain Monte Carlo MCMC) method. Another faster way is to use a variational distribution to lower bound the posterior distribution and maximize the bound by optimizing over the variational parameter space. The basic idea of variational inference VI) [41], [42] is very simple. The goal is to minimize the Kullback-Leibler KL) divergence between posterior and variational distribution. Let Q ν W ) denote the variational distribution where ν is the set of parameters for distribution Q. The KL divergence between this two distribution can be written as, D KL Q ν W ) pw X, θ)) = E Q [log Q ν W )] E Q [log pw, x θ)] + log px θ) 2 2) where E Q is the expectation taken with respect to the variational distribution Q. For tractable optimization procedure usually a fully-factorized variational distribution is considered, which breaks all the dependencies among latent variables. This type of variational inference is also called Mean field variational inference. VI aims to approximate the posterior distribution with a factorized distribution of known form. It is often faster than its MCMC counterpart. It is a deterministic search algorithm that tries to optimize a given objective function. In most of the cases, because of non-convex nature of objective function it is stuck to a local minima. So the choice of initial condition is really crucial. VI requires the following steps: 46

47 Approximate the posterior PW X, θ) using a family Λ of distribution parametrized by the parameter ν. Determine a distribution P ν Λ such that P ν = arg min D KLPW X, θ), Qν)) Qν) Λ Where D KL PW X, θ), Qν)) is the KL-divergence between two distributions. The variational parameter ν is the parameter of optimization KL-Divergence KL-divergence or Kullback-Leibler divergence is a non-symmetric measure of the distance between two probability distributions P and Q D KL PX ), QX )) = px )log px) qx) dx = E P[logpX )] E P [logqx )] where, px ) and qx ) are the density functions of P and Q, respectively. It has the following properties: E P [logpx )] is called the entropy of P and it is denoted by HP). D KL P, Q) is not a metric as it is not symmetric. D KL P, Q) 0 D KL P, Q) = 0 iff P = Q. We have, D KL Qν), PW X, θ) = HQν)) E Qν) [logx, W θ)] + E Qν) [logx θ)] As D KL Qν), PW X, θ) 0, we have logpx θ) E Qν) [logx, W θ)] + HQν)) 2 3) So, the goal is to find a P ν Λ that maximize the right hand side of the Equation 2 3. So the optimization problem becomes: max Qν) Λ E Qν) [logpx, W θ)] + HQν)) 47

48 max Qν) Λ E Qν)[logPW X, θ)] + HQν)) + logpx θ) For tractability, Q is taken to be fully factorized distribution Qν) = k Q ν i w i ) and each Q νi w i ) and the conditional distribution Pw i W i, θ) are assumed to belong to the exponential family. The objective function can be written as the sum of S i for i = 1, 2, k and S i is given by: S i = E Qνi [logpw i W i, X, θ)] E Qνi [logq νi w i )] Now using the forms of the exponential family, we have Q νi w i ) = hw i )exp[ν i T w i Aν i )] pw i W i, X, θ) = hw i )exp[g i W i, X, θ) T w i Ag i W i, X, θ))] where g i W i, X, θ) denotes the natural parameter for w i when conditioning on the remaining latent variables and the observations X. Taking the derivatives of S i and equating to zero, we get the update equations as: ν i = E Q [g i W i, X, θ)] i = 1, 2, k 2 4) Qν) is assumed to be a fully factorized distribution, otherwise simplification in the above equations can not be done. Also without the assumption that Pw i W i, θ) is from an exponential family, the update equation 2 4 may not have an analytical form. We will use this inference technique in the material in chapter 4. 48

49 CHAPTER 3 GEOMETRIC AND STATISTICAL PROPERTIES OF STIEFEL MANIFOLD Analytic Manifold 3.1 Geometric Properties of Stiefel Manifold A d-dimensional topological manifold [4], [5] M is a topological space that satisfies: M is a Hausdorff space which is a topological space if for any two distinct point x and y, there exist neighborhoods U x around x and U y around y such that U x and U y are disjoint. M is locally homeomorphic to Euclidean space. For any point x M there exists a neighborhood U M around x and an mapping φ : U R d, such that φu) is an open set in R d. They are called co-ordinate chart U, φ) together. M is second countable, that is, there exists a countable system of open sets, known as a basis, such that every open set in M is the countable union of some sets in the basis. Given two coordinate charts U, φ) and V, ψ), if U V is non-empty, then the map φ ψ 1 is defined from the open set ψu V) R d to the open set φu V) R d. An analytic smooth or C ) manifold is a manifold such that for all coordinate charts U, φ) and V, ψ) either U V is empty or U V is nonempty and the map φ ψ 1 is analytic Stiefel Manifold Stiefel manifold [7], [6], [8], [43] [44] V n,p is the space whose points are p-frames in R n, where a set of p orthonormal vectors in R n is called a p-frame in Rp n). The Stiefel manifold V n,p is represented by the set of n p matrices X such that X T X = I p, where I p is the p p identity matrix; so V n,p = {X n p); X T X = I p }. So the Stiefel manifold V n,p consists of n p tall-skinny orthonormal matrices. V n,p defines a surface that is a subset of a sphere of of radius p in R n p with Euclidean distance. This is direct consequence of the fact that for X = x i,j ) V n,p i = 1, 2,, n; j = 1, 2,, p) and i j x 2 i,j = p. 49

50 Following are some interesting special cases [43]: a 1-frame is just a unit vector, so V 1,n = S n 1. an orthonormal n-frame is identical to an orthogonal matrix, so V n,n = On), the orthogonal group consisting all orthogonal n n matrices, with the group operation being matrix multiplication. an orthonormal n 1)-frame can be extended uniquely to an orthonormal n-frame with matrix of determinant 1, so V n 1,n = SOn), the special orthogonal group normal subgroup of On)) consisting all n n rotation matrix. The Stiefel manifold may be embedded in the np-dimensional Euclidean space of n p matrices. Clearly V n,p is a subset of the set R n p which admits a linear manifold structure. The set R n p is a vector space with standard sum and multiplication by a scalar. Now clearly this has a natural linear manifold structure. A chart of this manifold is given by φ : R n p R np and X R n p vecx ) R np where vecx ) is the operation of stacking all the columns of a matrix X from left to right below one after another. The dimension of this manifold R n p is np. Now manifold R n p can be further turned into a Euclidean space with the standard inner product defined as: X 1, X 2 := vecx 1 ) T vecx 2 ) = tracex T 1 X 2 ) This inner product induces a norm which is the standard Frobenious norm and defined by X 2 F = tracex T X ). Consider a function h : R n p symp) and X X T X I p ), where symp) is the set of all symmetric p p matrices. Clearly, symp) is a vector space. Also note that, V n,p = h 1 {0 p }. Proposition 3.1. Submersion theorem) [8] Let F : M 1 M 2 be a smooth mapping between two manifolds of dimension d 1 and d 2, d 1 > d 2, and let y be a point on M 2. If y is a regular value of F i.e. the rank of F is equal to d 2 at every point of F 1 y) ), then F 1 y) is a closed embedded submanifold of M 1, and dimf 1 y)) = d 1 d 2. 50

51 In order to obtain the dimension of V n,p, we first notice that the dimension of symp) is 1 pp + 1) as a symmetric matrix is completely determined by its upper triangular part, 2 including the diagonal. From the above proposition we have, dimv n,p ) = np 1 pp + 1) 2 There is another way to realize the dimension of V n,p by finding out the number of functionally independent conditions on np elements of X. We will find out the number of dependent condition for one column each. For the first column it is 1 because each column is a unit vector. For the second column it is 2 because, not only it is a unit vector but also it has to be orthogonal to the first column. Similarly for the third one it is 3, because it has to be orthogonal to the first two columns. So for all the p columns number of dependent condition is p = 1 2 pp + 1). So the dimension of V n,p = number of independent conditions = np 1 2 pp + 1). Topologically, V n,p is nothing but a topological product of p spheres as S n 1 S n 2 S n p. We see V n,p as an embedded manifold of R n p, its topology subset topology induced by R n,p. All the columns of the element of a Stiefel has unit norm, so Frobenious norm, X 2 F = p, so V n,p is bounded. Also, V n,p is the inverse image of the closed set {0 p } under the continuous function h, so it is closed. Then by applying Heine-Borel theorem we have, V n,p is compact Group Action Definition 1. Let G be a group and X a set. Then G is said to act on X on the left), if there is mapping φ : G X X satisfying two conditions If e is the identity element of G, then φe, x) = x x X If g 1, g 2 G, then φg 1, φg 2, x)) = φg 1 g 2, x) x X We define the right action in similar manner. When G is a topological group, X is a topological space, and φ is continuous, then the action is called continuous. 51

52 One examples of group action is - invertible linear map acting on a real vector space: φ : GLn) R n R n, φa, x) = Ax, where GLn) is the group of invertible n n matrices. This example can be extended to other subgroups of GLn), such as orthogonal group On), special orthogonal group SOn) etc. Definition 2. Two points x 1, x 2 X are said to be equivalent under G, written x 1 x 2, if there exists a g G such that x 2 = φg, x). Definition 3. The function f whose domain is X, is said to be invariant under G, if f φg, x)) = f x) x X g G Definition 4. If x 1 x 2 x 1, x 2 X, then the group G is said to act transitively on X and X is said to be homogeneous with respect to G. Definition 5. For x 0 X, we define the subgroup G 0 of G as isotropy group of G at x 0 which consists of all transformations which leaves x 0 invariant. G 0 = {g G : φg, x 0 ) = x 0 } Definition 6. Let G 0 be the isotropy group of G at x 0 X. For each g G, the set gg 0 = {φg, g 0 ), g 0 G 0 } G is called a left coset of G 0 of G. we define the quotient G/G 0 := {gg 0, g G} as the set of cosets of G 0 in G. In other words, if x 0 is any point of a homogeneous space X with respect to a group G) and G 0 is the subgroup consisting of all elements of G which leave x 0 invariant, and if h G transform x 0 into x, then the set of all elements of G which transform x 0 into x is the left coset hg 0. Thus the points x X are in one-to-one correspondence with the left cosets hg 0. Hence a space, homogeneous with respect to a group of transformations, may be regarded as a space of left) cosets of the group. 52

53 We take the set X = V n,p and G = On). The action left) of On) on V n,p is given by: φon) V n,p V n,p, φq, A) = QA with the group operation being matrix multiplication. On) acts transitively on V n,p. The isotropy subgroup of On) at [I p : 0] T V n,p is G 0 = I p 0 On), B 1 On p) 0 B 1 and the coset corresponding to Q 1 V n,p is [Q 1 : Q 2 ]G 0 where Q 2 is any n n p) such that [Q 1 : Q 2 ] On). The coset consists of all orthogonal n n matrices with Q 1 as the first p columns. Writing the homogeneous space V n,p as the coset space of the isotropy group we have V n,p = On)/On p). Definition 7. An exterior differential form [44], [6] of degree r in R n is an expression of the type i 1 <i 2 < <i r h i1 i r x)dx i1 dx ir where h i1 i r x) are analytic functions of x 1,, x m. We can regard the above equation as the integrand of an r-dimensional surface integral. Note that A form of degree m has only one term, namely, hx)dx 1 dx m. A form of degree greater than m is zero because at least one of the symbols dx i is repeated in each term. We can equivalently define exterior differential form on analytic manifolds and these can be used to construct invariant measures on such manifolds. For any matrix X, dx denotes the matrix of of differentials dx ij ). By using rules of matrix calculus, we can see if X is n m and Y is m p, then dxy ) = X.dY + dx.y 53

54 For an arbitrary n m matrix X, the symbol dx ) will denote the exterior product of the mn elements of dx : dx ) m n dx ij. Similarly, if X symm), the symbol dx ) will denote the exterior product of the 1 mm + 1) 2 distinct elements of dx : dx ) 1 i j m dx ij, and, if X skew-symm), the symbol dx ) will denote the exterior product of the 1 mm 1) 2 distinct elements of dx : dx ) i<j dx ij. Proposition 3.2. Let Z be an n mn p) matrix of rank m and write Z = H 1 T, where H 1 is an n p with H1 T H 1 = I p and T is an p p upper-triangular matrix with positive diagonal elements. Let H 2 a function of H 1 ) be an n n p) matrix such that H = [H 1 : H 2 ] is an orthogonal n n matrix and writing H = [h 1 h p : h p+1 h n ], where h 1,, h p are the columns of H 1 and h p+1,, h n are columns of H 2. Then dz) = p t n i ii dt )H T 1 dh 1 ) where, H T 1 dh 1 ) = p n j=i+1 h T j dh i Considering the above differential form where H 1 V n,p, we have H T 1 dh 1 ) p n j=i+1 h T j dh i where [H 1 : H 2 ] = [h 1 h p : h p+1 h n ] On) is a function of H 1. Note that, this differential form does not depend on the choice of the matrix H 2 and invariant under the transforms H 1 QH 1 [Q On)] and H 1 H 1 P [P Op)]. 54

55 This defines an invariant measure [44] on the Stiefel manifold V n,p. The surface area or volume of the Stiefel manifold V n,p is It can be shown that VolV n,p ) = H1 T dh 1 ) V n,p np 2 H1 T dh 1 ) = 2p π V n,p Γ p n), 2 where Γ p ) is the multivariate Gamma function, which is a generalization of the Gamma function. Γ p n 2 ) = π pp 1) 4 p n Γ j ) 2 Proposition 3.3. If X is a topological space and G is a transitive compact topological group of transformations of X onto itself such that HX is a continuous function of H and X into X, then there exists a finite measure µ on X invariant under G. µ is unique in the sense that any other invariant measure on X is a constant finite multiple of µ. The measure µ defined on V n,p is called invariant unnormalized measure. It is often called the Haar measure. This measure can be normalized to a probability measure by setting [dh] = We definite this measure by µ, so that 1 VolV n,p ) HT 1 dh 1 ) so that, µ A) = [dh] A BV n,p ) A V n,p [dh] = 1 where BV n,p ) is the Borel σ-algebra generated by the open sets of V n,p. Now µ has the following two properties: µ ) is left-invariant under the action of On) on V n,p, so µ QA) = µ A) Q On) µ ) is right-invariant under the action of Op) on V n,p, so µ AQ) = µ A) Q Op) 55

56 µ ) plays the same role on V n,p that the Lebesgue measure plays on R n, but as the manifold V n,p is compact so this measure is finite. Thus it is the uniform distribution on V n,p and µ ) is the unique probability measure which is invariant under rotations and reflections Tangent and Normal Space of Stiefel Manifold The tangent space [7] at a point p is the plane tangent to the submanifold at that point. For a d-dimensional manifold, the correct way to visualize the tangent space is to look it as a d-dimensional vector space with the origin at the point of tangency. Normal space is the orthogonal complement of this vector space. Let us define a point on Stiefel by Y and we have Y T Y = I. On differentiating, we get Y T + T Y = 0 Y T is skew symmetric. For this skew-symmetry we see that there are pp+1) 2 constraints on. So the vector space of all tangent vector has dimension dimv n,p ) dimv n,p ) = np pp + 1) 2 = pp 1) 2 + pn p) 3 1) As we are viewing the Stiefel manifold as one embedded in Euclidean space [7], we can use the Normal p Tangent Manifold Figure 3-1. The tangent and normal spaces of an embedded manifold standard inner product in np-dimensional Euclidean space, which is T 1 2 = trace T 1 2 ) 56

57 and this is also Frobenious inner product for n p matrices. The normal space at a point Y consists of all the matrices with trace T N) = 0 Using the fact that any matrix can be broken as a sum of its projection onto the normal and tangent spaces, we see that the general form of tangent directions at Y, = YA + Y B = YA + I YY T )C where A is p p skew-symmetric, B is any n p) p, C is any n p matrix. Clearly, B = Y T C. Y is any n n p) such that YY T + Y Y T = I. The exact closed form of the geodesic equation for a curve Y t) on V n,p is given in Edelman s paper by: ) Y t) = Y 0), Y 0) expt A I S0) A I 2p,p e At where A = Y T Ẏ and S is a symmetric matrix given by Ẏ T Ẏ. On the other hand using the quotient geometry on Stiefel, we can define canonical metric and geodesic [7]. The canonical metric is given by g c, ) = trace T I 1 2 YY T ) ). Regarding the geodesic, Edelman [7] gave the functional form of it emanating from Y in the direction of H by Y t) = YMt) + QNt) where Y and H are n p matrices such that Y T Y = I p and A = Y T H is skew symmetric and QR := K = I YY T )H 57

58 is the compact QR-decomposition of K. Mt) and Nt) are given by the following matrix exponential Mt) Nt) = exp t A RT R 0 I p Statistical Properties of Stiefel Manifold Probability Distribution on Stiefel Manifold Let X be an n p random matrix on V n,p. The differential form dx discussed later) gives the unnormalized invariant measure on Stiefel V n,p. This in turn gives a normalized invariant measure [6], [43] or normalized Haar measure µ which is the uniform distribution denoted by [dx ] = dx )/VolV n,p ), where VolV n,p ) = 2p π np 2 Γ p n 2 ). Proposition 3.4. If X is distributed uniformly on V n,p, then H 1 XH T 2 is also uniformly distributed for any H 1 On) and H 2 Op) where H 1 and H 2 are independent of X, hence we have EX ) = 0 Let H 1 and H 2 be i.i.d distributed on On) and Op), respectively, and let X 0 be any n p matrix in V n,p, constant or independent of H 1 and H 2, then the random matrix H 1 X 0 H T 2 is uniformly distributed on V n,p. A random matrix uniformly distributed on V n,p can be expressed as X = ZZ T Z) 1 2 Our Bayesian framework uses one of the non-uniform distributions on V n,p which is known as the Matrix Langevin distribution. The one dimensional special case of this distribution is von-mises distribution on hypersphere. So based on the normalized Haar measure µ, an exponential family of probability distribution has been defined on V n,p in the following way. The density function can be written as: df X ) = 1 0F 1 n 2, 1 4 F T F ) exptracef T X ))[dx ] 3 2) 58

59 where F is an n p parameter matrix and 0 F 1 is a function of zonal polynomial or alternatively hypergeometric function with a matrix argument. We can write the normalizing constant as: 0F 1 n 2, 1 4 F T F ) = exptracef T H 1 ))[dh 1 ] V n,p where, dh 1 is normalized invariant measure on V n,p. Also it can be shown that: 0F 1 n 2, 1 4 F T F ) = On) exptracef T H 1 ))[dh] where H = [H 1 : H 2 ] On) and dh is normalized invariant measure on On). Actually the more general and flexible class of probability densities having linear as well as quadratic terms is given by - f BMF X A, B, F ) exptracef T X + BX T AX )) where A and B are generally taken to be symmetric matrices and matrix. For F = 0, we get Matrix Bingham distribution and for A or B equals to zero we get Matrix Langevin distribution. In this dissertation we have only talked about the Matrix Langevin distribution Properties of Matrix Langevin Distribution This density [6], [43] has a mode at X = M, where M is the polar part or orientation of F. F can be decomposed into product of two matrices i.e F = MK by the polar decomposition method. Clearly, M V n,p and K is a p p symmetric positive semi-definite matrix. K is the elliptical part or concentration of F. We saw that F = MK and K is a symmetric positive definite matrix, so K can be eigen decomposed by K = UD φ U T where U Op) and D φ is a diagonal matrix diagφ 1, φ 2, φ p ). Now we can write the hypergeometric normalizing constant as: 0F 1 n 2, 1 4 F T F ) = 0 F 1 n 2, 1 4 K T M T MK) = 0 F 1 n 2, 1 4 K T K) 0F 1 n 2, 1 4 K T K) = 0 F 1 n 2, 1 4 UD φu T UD φ U T ) = 0 F 1 n 2, 1 4 D2 φ) where M T M = I p as M V n,p and 0 F 1 ) depends only on the eigenvalues of the parameter matrix. Let X 1, X 2, X n are the observations generated from this density. As this is an exponential family the maximum likelihood estimator MLE) of the parameter F is 59

60 ˆF = ˆM ˆK and determined by: EˆF X ) = X But X may not be a point on V n,p, so it is not an suitable estimator. Let R be the elliptical part of X, then R = X T X 1 2 so we have ˆM = X R 1. In order to get ˆK we can use the following equations R = U T diagr 1, r 2,, r p )U where U Op) and r 1 r 2 r p ˆK = U T diagˆφ 1, ˆφ 2,, ˆφ p )U where ˆφ 1 ˆφ 2 ˆφ [ p n r i = 0F 1 φ i 2, 1 )] 4 diagφ2 1, φ 2 2, φ 2 p φ 1,,φ p )= ˆφ 1,, ˆφ p ) i = 1,, p The approximate solution are given by n p 1 1 )ˆφ i + 2 ˆφ i nr i when ˆφ i is small i = 1,, p p ˆφ i + ˆφ j ) 1 21 r i ) when ˆφ i is large i = 1,, p Computation of the Hypergeometric Function of a Matrix Argument We have adapted the algorithm that was presented in Koev2006) [45] paper in order to approximate the hypergeometric function of a matrix argument. The hypergeometric function of a matrix argument is scalar-valued. The hypergeometric function of a matrix argument is defined as follows. Let p 0 and q 0 be integers and let X be an n n complex or real symmetric matrix with eigenvalues x 1, x 2,, x n. Then, pf q α) a 1,, a p ; b 1,, b q ; X ) k=0 β k a 1 ) α) β a p ) α) β k!b 1 ) α) β b q ) α) β C α) β X ), where α > 0 is a parameter, β k means β = β 1, β 2, ) is a partition of k i.e. β 1 β 2 0 are integers such that β = β 1 + β 2 + = k, a) α) β i,j) β a i 1 ) α + j 1 is the generalized Pochhammer symbol and C α) X ) is the Jack function which is a symmetric, β homogeneous polynomial of degree β in the eigenvalues x 1, x 2,, x n of X. 60

61 which is, The approximation of this infinite series is done by computing its truncation for β m, m p F q α) a 1,, a p ; b 1,, b q ; X ) m k=0 β k a 1 ) α) β a p ) α) β k!b 1 ) α) β b q ) α) β C α) β X ), This series: converges for any X, when p q converges if max i x i < 1 when p = q + 1 diverges when p > q + 1 β-term converges to zero as β as it converges Koev et. al. exploited recursive combinatorial relationships between the Jack functions, which allowed them to only update the value of a Jack function from other Jack functions computed earlier in the series. This method is really efficient in terms of computational complexity which is only linear in the size of the matrix argument. In a very special case, when X is a multiple of the identity, algorithm is even faster Sampling Random Matrices from Matrix Langevin Distribution on V n,p Notice that, when p = 1, the Stiefel manifold V n,1 is nothing but the unit hypersphere S n 1 in R n. This distribution [46] on unit hypersphere is termed as von-mises-fisher vmf) distribution which has a density w.r.t the uniform distribution given by: p vmf x µ, c) = c n/2 1 2π) n/2 I n/2 1 c) expcµt x) where x S n where c 0 and µ = 1 and I r denotes the modified Bessel function of the first kind and order r. Wood1994) [47] provided a straightforward rejection sampling procedure to sample a vector from vmf distribution on S n. Hoff2009) [46] discussed two sampling methods - one via rejection sampling and another using Gibbs sampling method to generate sample from Matrix Langevin ML) distribution whose density is given by: f ML X F ) = 1 0F 1 n 2, 1 4 D2 ) exptracef T X )) where X V n,p 61

62 and D is the diagonal matrix of singular values of F. Both of these methods actually based on the sampling procedure for vmf distribution on S n The Rejection Sampling Method The parameter matrix F can be decomposed via Singular Value Decomposition or SVD in F = UDV T, where U and V are n p and p p orthonormal matrices, respectively and D is diagonal matrix with positive entries. The probability density of X can be written as exptracef T X )) = exptracevdu T X )) = exptracedu T XV )). The mode of this probability density is at X = UV T and the entries in D,the concentration parameters indicate that how close a random matrix close to its mode. The mode finding problem is is equivalent to finding the nearest orthogonal matrix to a given matrix M. To find this orthogonal matrix R, one uses the singular value decomposition M = W ΦY T to write R = WY T. For the rejection sampling method, a uniform density envelop f u was used. As the mode of the ML distribution is at UV T so the distribution has the maximum density f 0 ML = exptraced)) 0F 1 n 2, 1 4 D2 ) Initially generate np number of random variables us from Normal0, 1) and arrange them in a n p matrix Z and form X = ZZ T Z) 1/2. Now Z is full rank with probability 1. As we can see from the properties of the uniform distribution on V n,p, X will be uniformly distributed. We need to generate u uniform0, 1) independent of X, now if f 0 ML u < f ML X ) = u < exptracef T X D)), then we accept X as one of our positive samples. Otherwise we will reject this X and start afresh. But this can be very inefficient according to Chikuse2003) [6]. Hoff discussed new techniques to sample from ML distribution on V n,p. First he discussed about sampling a matrix one column each time from a ML distribution whose parameter is an orthonormal matrix H. The basic algorithm is as follows: He also showed that, if the parameter matrix H has non-orthogonal columns, then the acceptance ratio will be very low for the above algorithm. So 62

63 Algorithm 6 ML sample generation - I Sample X [,1] MLH [,1] ) for r = 2 to p do Construct N r, an orthonormal basis for null space of X [,1,2,,r 1)] Construct z MLN T r H [,r] ) set X [,r] = N r z end for he discussed an second algorithm: where CH) is given by the following quantity: Algorithm 7 ML sample generation - II Do the SVD of F = UDV T and let H = UD Samples pairs {u, Y } until u < f MLY ) f u by the following method, Y )CH) Sample u unif 0, 1) Sample X [,1] MLH [,1] ) for r = 2 to p do Construct N r, an orthonormal basis for null space of Y [,1,2,,r 1)] Construct z MLNr T H [,r] ) set Y [,r] = N r z end for X = YV T CH) = { p 1 0F 1 n, D2 ) 2 n p 1)/2 Γ n p 1 ) I } n p 1)/2 H [,r] ) 2 r=1 H [,r] n p 1)/ Gibbs Sampling Method For large values of D or p, the rejection sampling method is not particularly suitable. In this case it is better to use Gibbs sampler method to generate random X from the ML distribution on V n,p. The probability density can be also written as: f ML X F ) exptracef T X )) = p r=1 expf T [,r]x,r ) As the columns are orthogonal to each other so they are not independent. Also X = {X [,1], X [, 1] } = {Nz, X [, 1 }, where z S n p+1 and N is an orthonormal basis for the null space of X [, 1]. Conditional distribution of z is given by a von-mises-fisher vmf) density: pz X [, 1] expf T [,1]Nz) = expc t 1z). 63

64 This helps to write down a Markov chain for X. Let the current value of X is X i. Now in order to generate the next value X i+1), we follow the following steps: This is a reversible, Algorithm 8 Gibbs sampler method For any random k {1, 2, p} do the following steps: Choose an orthonormal basis for null space of X [, k] sample z vmf N T F [,k] X [,k] = Nz return X i+1) = X aperiodic, irreducible Markov chain if p < n. This chain converges in distribution to MLX ; F ) as i. The are couples of things to be noted: if n = p, then the chain is reducible. So two columns of X are sampled at a time to construct a irreducible Markov chain. Non-orthogonality of columns of F can give rise to a poor mixing Markov chain. In order to improve on that, Gibbs sampling is done on Y MLUD), where svd of X is UDV T and finally generate X = YV T. 64

65 CHAPTER 4 BAYESIAN ANALYSIS OF MATRIX-LANGEVIN ON THE STIEFEL MANIFOLD 4.1 Preliminaries Analysis of directional data comprises one of the major sub-fields of study in Statistics. Directional statistics deals with observations that are unit vectors, or sets or ordered tuples of unit vectors, in the n-dimensional space R n. Since the sample space is not the usual Euclidean space, standard methods developed for the statistical analysis of univariate or multivariate data do not apply immediately. Stated differently, incorporating the intrinsic structure of the sample space is essential to the proper analysis of such data. There is extensive literature on the statistics of circular and spherical data. In addition, there has been significant interest in the study of more general sample spaces such as the Stiefel and the Grassmann manifold. In particular, Downs [48], Khatri and Mardia [49], and Jupp and Mardia [50], have developed statistical methods for data that lie on the Stiefel manifold. When the orientation of an object, or some derived feature, lies in a space of non-zero curvature, the usual probability distributions can not be used to describe it. A space that is a natural fit for such data is the Stiefel manifold. A good statistical framework to infer the parameters of the probability distribution in this general sample space would clearly be of use to several areas of scientific enquiry. An appropriate marriage with Bayesian inference, which, with the ever growing computational power of the digital computer, has come of age in a wide variety of situations, would be even more beneficial. In this chapter we develop a Bayesian framework for a particular distribution known as the Matrix-Langevin henceforth denoted by ML) on the Stiefel manifold V n,p. We begin by proposing appropriate priors and deriving the posterior estimate of parameters of a ML on V n,p. We then extend the framework to a finite mixture model of ML and finally to a non-parametric Dirichlet Process Mixture DPM) model of ML which can potentially accommodate an infinite number of clusters. We also demonstrate a faster variational inference scheme for the DPM model. 65

66 4.2 Motivating Example: Dictionary Learning In a Bayesian dictionary learning framework [51], a signal X i or data) is represented as an n-dimensional vector and the over-complete dictionary D is represented as an n K matrix, with n < K, such that the signal X i is decomposed as X i = D s i + ɛ i = D [w i z i ] + ɛ i, where denotes the Hadamard product, where ɛ i is the noise and s i, the factor score. In the Bayesian context, each of the three components D, s i, ɛ i ) can be regularized using appropriate priors. In particular, due to over-completeness of D, s i must be properly regularized to ensure uniqueness of solution, and one popular Bayesian solution is to place a sparsity-promoting prior on s i. Using Hadamard product to decompose s i = [w i z i ] into the weight vector w i and the binary indicator z i that indicates the selection of columns atoms) from the dictionary D, the sparsity-promoting prior can be modeled using a Bernoulli process [51] on z i. The prior for ɛ i depends on the assumed noise model, and it is typically problem-dependent. The prior for the dictionary D is often given by a general distribution such as MVN 0, 1I n n) for each column d i i = 1, 2,, K) of D, which is a n 1) length vector. However, in signal processing applications [52], the columns of D are often assumed to be normal vectors, and this requires the prior for D to be specified as a distribution on the unit sphere in R n, the simplest kind of Stiefel manifold. More specific assumptions on the nature of the signal X i would require the prior for D to be a distribution on a general Stiefel manifold, and motivating examples come from several important recent work on compressed sensing of block-sparse signals and multi-banded signals [53 56]. In this context, the signal X i is assumed to lie in the union of a small number of low-dimensional subspaces, and this effectively partitions the columns of D into small blocks 66

67 of size p, Dj) for j = 1, 2,, M, D = d 1 d p }{{} D[1] : d p+1 d 2p }{{} D[2] : d K p+1 d }{{ K. } D[M] In this new setting, the sparsity prior is now placed at the block-level and the usual notions of coherence and sparsity [52] assume a more general form of block-coherence and block-sparsity. Since each block is low-dimensional and under-complete, uniqueness of solution within each block is assured and no regularization is required within each block. In particular, one can assume that each block Dj) is composed of orthonormal columns, and the natural prior for the block-structured dictionary D is then a distribution on an appropriately chosen Stiefel manifold. 4.3 The Stiefel Manifold and ML Distribution When p-distinguishable ordered directions in n-dimensions n p) are required to describe each orientation, [48] has given methods for summarizing and comparing orientations of samples of orientable objects. Let x j be an n 1 column vector whose elements are values given in some fixed co-ordinate system, for the j-th direction. Then the orientation of a single object may be expressed as an n p matrix X of rank p whose columns are x 1, x 2,, x p. This can be formally imposed by requiring X to satisfy X T X = C where C is a symmetric positive-definite p p matrix. The space of all such X is known as the Stiefel C-manifold. When C is assumed to be identity I p ) we have the usual Stiefel manifold. Henceforth, we shall denote this by V n,p or On, p). Informally, V n,p consists of n p tall-skinny orthonormal matrices. Our Bayesian framework uses a non-uniform distributions on V n,p known as the ML distribution. The one dimensional special case of this distribution is the von-mises distribution on a hypersphere. Based on the normalized Haar measure [dh], an exponential family of probability distribution [48] has been defined on V n,p in the following manner. The density function [6], [43] can be written as: df X ) = exptracef T X )) 0F 1 n 2, 1 4 F T F ) [dx ] 4 1) 67

68 where F is an n p parameter matrix and 0 F 1, the normalizing constant [49], is a Hypergeometric Function with a Matrix argument [57], [58], [59]. F can be decomposed into the product of two matrices, F = MK, via the Polar decomposition method. Clearly, M V n,p and K is a p p symmetric positive definite matrix. The density has a mode at X = M, where M is the polar part or orientation of F. K is the elliptical part or concentration of F. Since F = MK and K is a symmetric positive definite matrix, K can be eigen decomposed by K = UDU T where U Op) and D is a diagonal matrix diagd 1, d 2,, d p ). We can now write the hypergeometric normalizing constant as: 0F 1 n 2, 1 4 F T F ) = 0 F 1 n 2, 1 4 K T K) = 0 F 1 n 2, 1 4 D2 ) since M T M = I p as M V n,p. 0 F 1 ) thus depends only on the eigenvalues of the parameter matrix. The more general and flexible class of probability densities having linear as well as quadratic terms is f BMF X A, B, F ) exptracef T X + BX T AX )) where A and B are generally taken to be symmetric matrices. For F = 0, we get the Matrix Bingham distribution and for A or B = 0 we get the ML distribution. In this paper we restrict ourselves to the ML distribution. However, it is not difficult to see that our technique extends to the more general family of distributions. 4.4 Parametric Bayesian Inference for the ML Distribution As described above, the parameter F has two distinct components M polar) and K elliptical or concentration). They play very different roles in giving shape to the underlying distribution. M is responsible for pure rotation whereas K is responsible for concentrating the distribution around the mode M. We will assume that the prior distributions for M and K are independent. Note next that M, being a unitary matrix itself, lies on the Stiefel. Also, since it does not occur in the denominator of ML, its inference is likely to be simpler. In contrast, the inference for K, which is a positive-definite matrix is not likely to be straightforward since we are confronted with the Hypergeometric Function on K in the denominator of ML. The 68

69 primary contribution of this work is that we have successfully constructed a full Bayesian framework for both parameters, which [1] could not accomplish. Conjugate priors have been widely used to incorporate prior beliefs seamlessly into the Bayesian framework. The choice of the conjugate prior depends heavily on the form of the likelihood function. Our assumption that the prior distributions for M and K are independent is at odds with the ML distribution where the likelihood is determined by a nonfactorizable function of their product. As a result, the posterior is not factorizable like the prior, and hence the prior is not conjugate in general. However if we were to fix either one of the parameters, the conditional distribution of the other parameter can still be made conjugate to the likelihood which in the literature is known as a Conditionally conjugate model. We follow this path Likelihood for the ML Distribution Let N samples of data X N = {X 1, X 2,, X N } be generated i.i.d. from the ML distribution on V n,p with parameter matrix F = MK. MLX i ; F ) = exptracef T X i )) 0F 1 n 2 ; 1 4 F T F ) 4 2) Let R = N X i. Using K T = K, M T M = I p and diagonalizing K = UDU T, we have F T F = D T D = D 2. The complete data likelihood is LX N ) = exptracekm T R)) N 0F 1 n 2 ; 1 4 D2 ) = exptracert MK)) 0F 1 n 2 ; 1 4 D2) N 4 3) Prior for the Polar Part M Since M lies on V n,p, we assume another ML with hyper-parameter G 0 as the prior distribution for M. MLM; G 0 ) = exptraceg T 0 M)) 0F 1 n 2 ; 1 4 G T 0 G 0) G 0 = M 0 Q 0 where Q 0 = λ I p and λ R 69

70 4.4.3 Posterior for the Polar Part M sampler. We now compute the full conditional distribution for M, which will be used in the Gibbs Now we have, PM X i, K) = PX i M, K) PM) PK) PX i K)PK) N ) T PM X N, K) = ML X i K + G 0 M 4 4) Thus we get a conditionally conjugate prior for M for the given likelihood Prior for the Elliptical or Concentration Part K We first assume that K is a diagonal matrix D. This assumption is motivated primarily by the simplicity it confers to calculations. We extend our results to a more general K in a later section. Observing the likelihood, we propose a prior distribution, which is proportional to ΠD; α, Σ) exptraceσd)) 0F 1 n 2 ; 1 4 D2 ) ) α 1D [0, 1] p ) where hyperparameters α > 0 and Σ R p p. 1D [0, 1] p ) denotes the indicator random variable with diagonal entries; the eigenvalues are bounded above by 1 and below by 0. The framework remains unchanged when the eigenvalues are bounded above by any t > 0. The ML distribution is defined for a particular K = D. Since the Stiefel manifold is compact, the integral of ML w.r.t X i is bounded and hence a finite normalizing constant exists. However, K lies on an unbounded cone which is not compact the PSD cone). It is therefore unlikely to have a proper posterior if we assume the support of the prior to be unbounded, since the integration w.r.t K might not yield a finite normalization constant. Thus the need to bound K in some manner. Note in contrast that the prior for M does not suffer from this issue since M lies on the Stiefel which is compact Upper and Lower Bounds for the 0 F 1 ) Function A partition is a vector κ = κ 1, ; κ p ) of non-negative integers that are weakly decreasing: κ 1 κ 2 κ p. The entries κ 1,, κ p are called the parts of κ; the length 70

71 of µ is the number of non-zero κ j ; and the weight of κ is κ := κ 1 + κ κ p. For a Cor R) and any non-negative integer j, the rising factorial, a) j is defined as a) j = Γa + j) Γa) = aa + 1)a + 2) a + j 1) Corresponding to each partition κ, the partitional rising factorial, a) κ is defined as a) κ = p a 1 2 j 1)) κ j Let S be a real symmetric p p matrix. For each partition κ, we denote by Z κ S) the zonal polynomial of the matrix S. Let a Cor R) be such that a + 1 j 1) is not a non-negative 2 integer for all j = 1, p. For any symmetric p p matrix S, we define a generalized hypergeometric function of matrix argument, 0F 1 a; S) = k=0 1 k! κ =k Z κ S) a) κ = E say) 4 5) where the inner summation is over all partitions κ = κ 1,, κ p ) of weight k. Also, Z κ S) = traces)) k. 4 6) κ =k It follows from equations 4 5 and 4 6 that 0F 1 a; S) < 2 exptraces)) A Lower Bound From p a) κ = a j ) 2 κ j [ ] [ = a a + κ 1 1) a 1 2 ) a 1 ] 2 + κ 2 1) [ a p 1 2 ) a p 1 ] + κ p 1) 2 4 7) 71

72 we note that the product is maximized when the partition of k looks like {κ 1 = k, 0,, 0}. So a) κ aa + 1) a + k 1). Hence E > 1 1 Z κ S) 4 8) k! aa + 1) a + k 1) k=0 κ =k Now using the fact that the Arithmetic MeanAM) Geometric MeanGM), we have a a + k 1)) 1 a + + a + k 1) k k = a a + k 1)) a + k 1 ) k 2 From Inequality 4 8 and using κ =k Z κs) = T k where T = traces)), we have E > Note that, in our case a = p 2 1 k! k=0 T k a + k 1 2 ) k = 1 k! k=0 and p 1, so 2a 1 and k 1 T /a) k ) 1 + k 1 k 4 9) 2a 1 2a 1 = k 1 2a k 1) = 1 + k 1 2a k 4 10) From equation 4 9 and 4 10 and noting that the first term of the RHS is 1, we have E > 1 + k=1 1 T /a) k 4 11) k! k k We now use the Stirling s formula 2π k k k e k k! e k k k e k to bound k k 1 T /a) k ke k E > 1 + k! k k ke k=1 k 1 T /a e) k k = 1 + k! k k ke k=1 k T /a e) k k 2π k! k! k=1 1 + T /a e) k 2π k!) 2 = 1 + 2π k=1 T / a e) 2k k=1 k!) 2 72

73 = 1 + 2π 1 + 2π k=1 k=1 η 2 T ) 2k k!) 2 where η = 2 a e η 2 )2k T k k!) 2 where η = 2 a e 4 12) We know T = traces) = traced 2 ) and as D is diagonal and all its eigenvalues d 1, d 2,, d p ) are positive, we have T k = traced 2 ) ) k = p d 2 i ) k p d i ) 2k So by 4 12, we have E > 1 + 2π = 1 + 2π [ η 2 )2k p d ] i) 2k k!) 2 k=1 [ p ] η d i 2 )2k k!) 2 k=1 4 13) From the definition of the modified Bessel function of order zero I 0 x)) we have So, we have: [ k=1 ] η d i 2 )2k = I k!) 2 0 ηd i ) 1 E > 1 + 2π = 1 + 2π p I 0 ηd i ) 1) where η = 2 a e p I 0 ηd i ) p 2π 4 14) Figure 4-1 presents this lower bound for random choices of eigenvalues of D. Figure 4-1. Lower bound for 0 F 1 a; S) [in red] by RHS of equation 4 14 [in blue]. x-axis represents the sum of eigenvalues and y-axis denotes the function values. 73

74 Lower Bounds for I 0 x) x 0, as Note first that I 0x) = I 1 x). Observe next that expx) I 0 x) is an increasing function for all ) expx) = expx) I 0x) I 1 x)) I 0 x) I0 2x) > ) Applying I 0 x) > I 1 x), from equation 4 15, we have f x) = expx) I 0 x) so x > b f x) > f b) = = I 0 x) < expx) I 0b) expb) I 0 x) > x, in order to check it let us write the expression for I 0 x) is an increasing function, 4 16) I 0 x) = x/2) 2k k=0 k!) 2 = 1 + x/2) 2 + Ox 4 ) now x/2 1) 2 > 0 = 1 + x/2) 2 > x, so we have I 0 x) > x. Let us write an important identity of modified Bessel functions x > 0) [it is often called backward recurrence relation] I ν 1 x) I ν+1 x) = 2ν x I νx) 4 17) Also from [60] we have Turan type of inequality for modified Bessel function I ν 1 x)i ν+1 x) < I 2 ν x) for x > 0 and ν ) From equations 4 18 and 4 17 and putting ν = 1 [ I1 2 x) > I 0 x)i 2 x) = I 0 x) I 0 x) 2 ] x I 1x) = I 2 1 x) > I 2 0 x) 2 x I 0x)I 1 x) = y x y 1 > 0 where y = I 1x) I 0 x) 4 19) 74

75 4 As the discriminant = + 4 > 0, so it has two real roots. Also note that y > 0 by the x 2 definition of modified Bessel function. So, in order to hold inequality 4 19, y should be greater than the larger positive root r 1, where r 1 is given by x x r 1 = 2 = x + x + 1 = 1 2 x x 2 ) 4 20) So we have y > r 1 = I 1x) I 0 x) > 1 x x 2 ) = x I 1 x) > 1 + x 2 1) I 0 x) > x 2 1) I 0 x) = x I 1 x) > x 1)] I 0 x) 4 21) Let us take gx) = expx) x I 0 x) so g x) = expx) [x 1) I 0x) x I 1 x)] x 2 I 0 x)) 2 Now, g x) < 0 as from 4 21, we have x 1) I 0 x) x I 1 x) < 0. so, gx) < gb) when x > b and using this we get, expx) x I 0 x) < expb) b I 0 b) = I 0x) > expx) x b I 0 b) expb) 4 22) Now, I 0 b) > 1, so we have, I 0 x) > expx b) x/b) = expx lnx/b) b) > expax b) for some 0 < a < 1, such that x lnx/b)) > ax. So, we have: when x > b, I 0 x) > expax b) for some 0 < a < ) 75

76 Remarks When x > b we have a bound for I 0 x) > expax b). Clearly it depends on choice of b. From [61] we know We use generic constants c 1 and c 2 instead, I 0 x) 1 2 exp2x π ) exp 2x π ) 1 2 exp2x π ) I 0 x) > expc 1 x c 2 ) 4 24) Lower Bound for 0 F 1 ) Using Lower Bound for I 0 x) Now from equation 4 14 and 4 24 and noting that S = 1 4 D2 and a = n 2 we have, E 1 = 0 F 1 n 2 ; 1 4 D2 ) > 1 p 2π + 2π 1 p 2π + 2π p p I 0 η d ) i 2 expc 1 η d i 2 c 2) 4 25) we also know, from the AM GM inequality that 1 p = p expy i ) exp p y i ) p 1 expy i ) p exp p ) 1 p ) p y i 4 26) putting y i = c 1 η d i 2 c 2) and η = 2 a e = 2 2 n e, using the above inequality in equation 4 25, we can write Setting Z = c 1 2 p n E 1 > 1 p 2π + p 2π exp 1 p e d i c 2 ), we have, c 1 2 p n 1 e ) p d i c 2 0F 1 n 2 ; 1 4 D2 ) > 1 p 2π + p 2π expz) 4 27) > expz) 4 28) 76

77 Notice that, Z = c 1 2 p n 1 e traced) c 2 ) = c 3 traced) c 2. Where c 3 = c 1 2 p 1 n e. Using numerical simulations, when 0 x 1 we see Figure 4-2) that I 0 x) > expx 0.77). So in our particular case c 1 = 1, c 2 = 0.77 and c 3 = 2 p n 1 e. Figure 4-2. Lower bound for I 0 x) [in red] by expx 0.77) [in blue]. Note that the inequality I 0 x) > expx 0.77) holds only in the interval [0, 1] Posterior for the Elliptical or Concentration Part D Figure 4-3. This is a approximate profile for posterior density function for a 2 2 diagonal matrix when 100 data points are given Note that the prior for D looks like, ΠD; α, Σ) = 1 C 0 exptraceσd)) 0F 1 n 2 ; 1 4 D2 ) ) α 1D [0, 1] p ) 4 29) C 0, the normalization constant, is finite since the support of the distribution is compact. The data likelihood term is LX N ) = exptraceht D)) 0F 1 n 2 ; 1 4 D2) N p ) where H = M T X i 4 30) 77

78 From equation 4 29 and 4 30 we can write the full conditional distribution for D as, ΠD X N, M) = 1 C exptraceh T D + ΣD)) 0F 1 n 2 ; 1 4 D2) N+α 1D [0, 1] p ) < 1 exptraceh T D + ΣD)) 1D [0, 1] p ) C expz)) N+α = 1 exptraceh T D + ΣD)) 1D [0, 1] p ) C expn + α)z)) = 1 exptraceh T D + ΣD)) C expλ 1 traced) λ 2 ) 1D [0, 1]p ) where λ 1 = N + α) c 1 2 p 1 n e and λ 2 = N + α) c 2. = 1 C exptraceht + Σ λ 1 )D)) e λ 2 1D [0, 1] p ) = 1 C exptraceγd)) eλ 2 1D [0, 1] p ) 4 31) where Γ = H T + Σ λ 1 I p, λ 1 = N + α) c 1 2 p 1 n e, λ 2 = N + α) c 2, and C is the appropriate normalization constant. Numerical integration method is used to calculate C for sampling from the posterior of D Rejection Sampling As we have Γ 11 Γ 1p d 1 0 Γ p1 Γ pp = 0 d p Γ 11 d 1 Γ pp d p 78

79 We are not interested in the off diagonal entries. Now clearly, traceγ D) = p Γ ii d i = p β i d i say) So from equation 4 31, we have ΠD X N, M) < 1 C = 1 C e λ 2 ) p ) exp β i d i 1D [0, 1] p ) [ e λ 2 ) p ) ] 1 W W exp β i d i 1D [0, 1] p ) Where W = p 1 where the function gd i ) = density function as, 0 e β i d i ) dd i ) = β i e β i 1 p [ e β i d i ) β i ] 1 0 = p e β i ) 1 ) exp β i d i ) 1d i [0, 1]) for all i = 1, 2,, p is a proper 1 0 gd i ) dd i ) = 1 So the envelop distribution for the rejection sampling is the product of independent distribution having density for each i = 1, 2,, p β i Hence the posterior is gd i ) dd i ) = β i e β i 1) eβ i d i dd i ) 1d i [0, 1]) Where Q = 1 C e λ 2 ) p ΠD X N, M) < 1 e λ 2 ) p e β i ) ) [ ] 1 p gd i ) dd i ) C β i [ p ] = Q gd i ) dd i ) )) e β i 1 β i and the other part is a valid probability density function. Q is the acceptance-rejection ratio which depends on the value of C. This distribution is a 79

80 truncated exponential distribution and since we have a compact support, the integral is finite as long as each β i <. The truncated exponential CDF would be when each d i lies between 0 and 1) F d i ) = β di i e β i d i dd e β i ) = eβ i d i 1) i 1) 0 e β i 1) We use the inverse CDF transform sampling to generate i.i.d. samples from this distribution. First we sample a number u from uniform[0, 1] and then use the following equation to generate a sample from the above truncated exponential distribution in the [0, 1] range, which is our envelop distribution. u = eβ i d i 1) e β i 1) = d i = 1 β i log1 + u e β i 1)) 4 32) Thus we can sample a D by sampling all of its eigenvalues d i i = 1, 2,, p) Metropolis-Hastings MH) Sampling Scheme for D When we have a diagonal D, we can easily instrument a Markov chain so that the chain converge to the correct stationary distribution. Note that MH sampler does not give i.i.d samples as the samples are correlated. In order to remove the correlation thinning is employed and a large burn-in iteration of samples has to be discarded. In our case when the dimension of D is small then MH sampler is quite efficient. We have use a product of independent beta distributions as a proposal distributions. For a very general K, we can use Wishart as a proposal distribution Hybrid Gibbs Sampling manner, A 2-step Gibbs sampling method with a rejection sampler can be set up in the following 1. Initialize D to any random diagonal matrix where all entries are sampled from [0, 1]. 2. Sample M from the full conditional distribution of M given in Sample D from the full conditional distribution of D by the above MH or Rejection sampling scheme. 80

81 4. Repeat 2) and 3), until M and D converge to the appropriate stationary distribution Experiments on Simulated Data We have run simulations with various values for n and p. We report here our results for n = 5, p = 3. We used 1000 Gibbs iterations as our burn-in period. We ran the chain for another 1000 iterations after the burn-in. Since the samples from the MCMC are correlated, we picked every 20-th sample thinning). For simplicity of calculations, we did not put any distribution on the hyperparameters. For each simulation we simply used different values of α and Σ. The distance between the predicted M pred and M orig was calculated using the metric µ d = [ p tracempred T M orig) ]. If M pred were close to M orig, the metric would return a small value. For various experiments for n = 5 and p = 3, we got µ d values in the range 0.19 to For D, we give two typical examples. In one experiment we had D orig = diag0.8, 0.8, 0.8) and the simulation result predicted D pred = diag0.91, 0.79, 0.83). In another example we had D orig = diag0.8, 0.9, 0.7) and the predicted D pred = diag0.84, 0.91, 0.81). These results prove that the inference was satisfactory. For other choices of n and p we had similar results. Rejection sampling algorithms are often criticized for their slow convergence due to high rejection rates. This was partially true in our case, with some examples showing high rejection rates, and others not. Since the rejection sampling is done within the hybrid Gibbs sampler, the rejection rates were data dependent. A better data-adaptive scheme will likely improve our convergence rates Extension of the Model to a More General K Using K T = K, M T M = I p and diagonalizing K = UDU T, we have F T F = D T D = D 2. In the previous section, we assumed that K was diagonal. We now show how K can be generalized to a symmetric positive definite matrix all of whose entries lie between 0 and 1, i.e., for all i, j = 1, 2,, p we have K ij [0, 1]. Note the following: tracek 2 ) = tracek T K) = p p K ij ) 2 81

82 and this reduces to p K ii) 2 when K is diagonal. Careful inspection shows that equation 4 12 can also be written for a general K, E > 1 + 2π 1 + 2π = 1 + 2π η )2m p p 2 K 2) m ij m!) 2 η )2m p p 2 K ij) 2m) m=1 m=1 p p m=1 m!) 2 η Kij 2 ) 2m m!) 2 So using same inequalities involving modified Bessel function of order zero we have using AM GM). Hence 0F 1 n 2 ; 1 4 D2 ) > 1 + 2π = 1 p 2 2π + 2π = 1 p 2 2π + 2π p p p p = 1 p 2 2π + p 2 2π exp p p I 0 η K ) ij 2 I 0 η K ) ij 2 exp c 1 η K ) ij c 1 2 p 1 2 n e 2 c 2 p ) 1 ) p K ij c 2 where D is now a diagonal matrix containing eigenvalues of K and we know n 0F 1 2 ; 1 ) n2 4 K T K = 0 F 1 ; 14 ) D2. Also note that the entries of K are positive so p Setting Z = c 1 2 p 2 n 1 e p p K ij c 2 ), we have, p K ij > p K ii = tracek). 0F 1 n 2 ; 1 4 D2 ) > 1 p 2 2π + p 2 2π expz) > expz) We get a very similar expression for the posterior distribution ΠK X N, M) by replacing D by K. Note that 0 F 1 ) depends only on the eigenvalues of K. For the envelop distribution we 82

83 now have a total of 1 pp + 1) independent Truncated Exponential distribution in the range 2 [0, 1] as K is symmetric) and we can perform the rejection sampling as before by setting all integrations in the region [0, 1] pp+1)/ Log-convexity of the Hypergeometric Function In this set up we will take D to be general not just a p p) diagonal matrix. In the Matrix-Langevin density function we have 0 F 1 n 2, 1 4 D2 ) as the denominator, which is the Hypergeometric function with this general matrix argument D and all eigenvalues of D are positive because D is positive definite. So let us first take p = 2 and let us denote eigenvalues of D by r and s r, s > 0). Also write f r, s) = log{ 0 F 1 n 2, 1 4 D2 )}. Let us write down the Hessian matrix H as below, H = 2 f r,s) r 2 2 f r,s) s r 2 f r,s) r s 2 f r,s) s 2 Now in order to show log-convexity of the Hypergeometric function f r, s), we need to show H is positive definite A solution Let X R k be a k-dimensional random vector which belongs to the exponential family, and suppose that the probability density function of X is of the form f x) = exp v x cv)), x S, where S is the sample space i.e., the range of possible values) of X ; the vector v R k is the corresponding natural parameter ; and cv) is the usual normalizing constant. Because f is a density function then S f x)dx = 1, so it follows immediately that exp cv) ) = S expvx)dv. or cv) = log expvx)dv. S 83

84 Next, we want to calculate the moment-generating function of X. In the case of the matrix-langevin, moment-generating function exists for all t R k ; and also Σ, the covariance matrix of X, is strictly positive definite. So, for a k-dimensional vector t, we want to calculate Mt) := E expt X ). Then, it follows easily from the above formulas that, at least for all t in a suitably small neighborhood of the origin, Mt) = E expt X ) = expt X )f x)dx S = exp cv) ) exp t + v) x ) dx S = exp cv) ) exp ct + v) ). Therefore, we obtain log Mt) = ct + v). cv) We know that log Mt) is called the Cumulant-generating function of X. To calculate the covariance matrix of X, we simply have to differentiate log Mt) twice with respect to t and evaluate the derivatives at t = 0. To be precise, if t = t 1,..., t k ) then where i, j = 1,..., k. Σ = ) 2 t t log Mt) t=0 2 log Mt), t i t j t=0 Now, for i, j = 1,..., k, let us define the functions c ij t) = 2 t i t j ct). Then, by applying the earlier formula for log Mt) in terms of cv), we obtain 2 t i t j log Mt) = = 2 ct + v) t i t j cv) 1 2 ct + v); cv) t i t j 84

85 therefore, So, we have arrived at a rather neat formula: 2 log Mt) = 1 t i t j t=0 cv) c ijv); Σ = 1 cij v) ). cv) And finally, because Σ is positive definite then the matrix c ij v) ) also is positive definite; i.e., the matrix is positive definite. ) 2 2 cv) log v i v j v i v j S ) expvx)dv In the case that we want, we simply take X to be the collection of entries in the matrix-langevin random matrix viewed as a long column vector). And since we have the explicit formula for cv) in terms of the hypergeometric function of matrix argument then we have the positive definite condition that we were looking for. This p = 2 case can be generalized similarly using the same technique Possible ARS Sampling The above proof of log-convexity will make the posterior distribution a log-concave one and according to our earlier discussion it should enable us to implement ARS sampling scheme. But ARS sampling scheme traditionally developed for a univariate case. In order to apply to a multivariate scheme we need to modify the algorithm so that it can be feasible computationally. One possible algorithm is to first compute the envelop hyperplanes on the grid and not to look for actual intersection point because that will have huge computational burden. Instead we will take the minimum hyper plane on a particular grid and carry out the integration. With the help of a octree and higher) type of data structure it can be implemented easily. 4.5 Finite Mixture Modeling We can now extend our framework seamlessly to a finite mixture model. Noting that if there exists a sampling scheme for the posterior of the ML distribution, the extension will be 85

86 standard, we simply state the generative model for lack of space. Here, we assume knowledge of the number of mixture components, L. The mixture weights and parameters have prior distributions and the weights are typically viewed as a L-dimensional random vector drawn from a Dirichlet distribution. The full generative model is N = number of observations L = number of mixture components X i = i-th observation M i = Polar part of the parameter of distribution of observations associated with i-th Component K i = Elliptical part of the parameter of distribution of observations associated with i-th Component φ = {φ i } L = prior probability of i-th component such that L φ i = 1 z i = component of observation i φ Dirichletβ 1, β 2,, β L ) z,2, N Categoricalφ) M i M 0, K 0 ) MLM i ; M 0, K 0 ) K i α, Σ) ΠK i ; α, Σ) X i ML X i ; M zi, K zi ) 4.6 Infinite Mixture Modeling A variety of non-parametric Bayesian methods have become standard tools for modeling infinite mixtures. One of the more important examples in reference to NP-Bayes is the Dirichlet 86

87 Process DP) Mixture or DPM. The DPM was introduced by Antoniak [29] and has seen great popularity in recent times. Once again, due to space constraints, we only briefly describe the DPM modeling as a generalization of the finite mixture model. We give details of one particular method used in the inference of the DPM model, approximate variational inference, in the supplementary material. Since the inference based on MCMC would follow along the lines of appropriately adapted) the auxiliary variable algorithm from Neal2000)[38], we only briefly describe it here. For the approximate variational inference we have used the Conjugate Gradient method on the Stiefel manifold, and details are presented in the supplementary material DPM Modeling on the Stiefel Manifold From the basic DPM model, we have the following equations: G α, G 0 DPα, G 0 ) θ i G G i = 1, 2, N X i θ i F θ i ) i = 1, 2, N In our case, F is the ML distribution on the Stiefel manifold. The full generative model can be written as follows: v i α Beta1, α) i = {1, 2, } Draw η i G 0 G 0 i = {1, 2, } For the n-th data point n = {1, 2,, N} Draw z n v 1, v 2, multinomialπv )) Draw X n z n MLX n η zn ) G 0 is the base distribution for the Dirichlet Process and is defined on the product space of the parameters of the ML distribution. X n is also drawn from the ML with different parameters. πv ) is the vector for the stick-breaking weights. In the original construction this vector had infinite length. Note that, for the variational inference implementation, we have taken this to 87

88 Z n V t α X n η t λ N Figure 4-4. Graphical Model for variational inference of DPM be of a fixed length, T. This is called the truncated-stick-breaking process in the literature. Correspondingly, the set {η 1, η 2,, η T } are the atoms representing the components of the mixture distribution. So G can be written as where G = π i V )δ ηi π i V ) = v i Π i 1 k=1 1 v k) The corresponding graphical model for DPM is given by Figure MCMC Inference Scheme We can apply the earlier sampling techniques for sampling M and D and combine that with usual sampling techniques for DPMM model Variational Bayes Inference VB) on Stiefel Manifold The basic idea of VB on Stiefel manifold is similar to the method proposed in [42]. In the DPM context the latent variables are W = V, η, z. The hyperparameters θ are scaling parameters and parameters for conjugate base distribution which are Matrix-Langevin in our case, θ = {α, λ}. We will denote Matrix Langevin distribution on Stiefel V n,p for X with parameter F by LX, F ), LX, F ) = 1 0F 1 n 2 ; 1 4 F T F ) exptracef T X )) 4 33) 88

89 Here F is an n p parameter matrix which can be seen as a product of two matrices F = MK, where M, the polar part being the mode of the distribution and K is the scaling part which is a p p matrix. M is again a matrix with dimension n p and also lies on V n,p. Note that we can write down the unique singular value decomposition of F as, F = ΓΛΘ T, where Γ V n,p, Θ Op), Λ = diagλ 1, λ 2,, λ k ), λ 1 λ k 0 The expectation of X on Stiefel is given by E ST X ) = FR, where R is the p p matrix and R ij ) is given by R ij = 2 log 0F 1 n 2 ; 1 4 H) H ij where H = F T F 4 34) Now now on, we will denote the normalizing constant of the distribution by cf ). Note that, 0F 1 is the hypergeometric function of one matrix argument. Using equation 2 2, we can have a lower bound on the log marginal probability of the N numbers of data, log px α, λ) E Q [log pv α)] + E Q [log pη λ)] N + E Q [log pz n V )] + E Q [log px n z n )]) n=1 E Q [log Q ν V, η, z)] 4 35) The fully factorized distribution for Q can be written as: Q ν V, η, z) = Π T 1 t=1 Q γ t v t ) ) Π T t=1q τt η t ) ) Π N n=1q φn z n ) ) 4 36) Where T is the truncation level for variational distribution, which can be appropriately set. In our analysis, Q γt v t ) are beta distributions, Q τt η t ) are Matrix-Langevin distributions and Q φn z n ) are multinomial distributions. The parameter set ν is the set {γ 1, γ 2,, γ T 1, τ 1, τ 2,, τ T, φ 1, φ 2,, φ N }. 89

90 Note that each γ t has two components γ t,1 and γ t,2 and each φ n has T components φ n,t where t 1, 2,, T. Coordinate Ascent Algorithm for Optimization. The basic idea is to come up with an explicit coordinate ascent algorithm which in turn maximize the R.H.S of equation 4 35 w.r.t all of the variational parameters. So the function will be maximized w.r.t the parameters one by one and we will thus attain the minimum KL divergence between intractable posterior and the fully factorized variational distribution Matrix-Langevin distributions In this analysis we are using several Matrix-Langevin distributions, one is in the data generation part and other are used as prior for the parameter for data generation. They are given as: px n η) = cg) exptracegη T x n )) 4 37) pη λ) = cd) exptracedλ T η)) 4 38) Qη τ) = ce) exptraceeτ T η)) 4 39) Where cg),cd) and ce) are appropriate normalizing constants as mentioned in Equation On a separate note, this given framework can be easily extended by placing a gamma prior on α. In this case, α Gammaa 1, a 2 ) and the corresponding variational parameter is Gammaρ 1, ρ 2 ) Update equation for γ t Recall that T is the truncation level for variational distribution for DP. From equation 4 35, in order to get update equation for γ t, we need to collect all the terms which involves γ t only. Let Fγ t,1,γ t,2 ) denotes those terms. Thus rearranging the terms we can write, F γ t,1, γ t,2 ) = E Q [log pv α)] E Q [log Q ν V, η, z)] + }{{} i) N E Q [log px n z n )] n=1 } {{ } ii) 4 40) 90

91 Clearly, i) is nothing but the -KLQV ) pv )) and both the distributions are Beta distributions with parameters γ t,1, γ t,2 ) and 1, α), respectively. Using the standard formula for i) and using φ n,t for Qz n = t) in ii), we have: F γ t,1, γ t,2 ) = T 1 t=1 [ log Bγ t,1, γ t,2 ) + 1 γ t,1 )ψγ t,1 ) + α γ t,2 )ψγ t,2 ) B1, α) ] 1 + α γ t,1 γ t,2 )ψγ t,1 + γ t,2 ) N T T ) + φ n,j E Q [log1 v t ] + φ n,t E Q [log v t ] n=1 t=1 j=t+1 where, Bx, y) is the beta function defined as Γx)Γy) Γx+y) ψx) is the digamma function which is defined as d log Γx). dx where Γ ) is the gamma function and Note that, E Q [log1 v t ] = ψγ t,2 ) ψγ t,1 + γ t,2 ) and E Q [log v t ] = ψγ t,1 ) ψγ t,1 + γ t,2 ). Now differentiating F, first w.r.t to γ t,1 and then w.r.t γ t,2 and equating to zero, we have the update equations for γ t. γ t,1 = 1 + γ t,2 = α + N N n=1 T n=1 j=t+1 φ n,t φ n,j 4 41) Update equation for τ t First remember that τ t lies on Stiefel manifold. So we have to do the optimization on Stiefel manifold itself. We will be using Conjugate-Gradient CG) method in order to find the maxima on the manifold. The terms involving τ t are given as below: F τ t ) = E Q [log pη t λ)] E Q [log Qη t τ t )] + }{{} i) n E Q [log px n z n )] n=1 } {{ } ii) 4 42) As before, i) is nothing but the -KLQη t ) pη t )) which is = ST manifold log ) Lηt ; λd) Lη t ; τ t E)dη t Lη t ; τ t E) 91

92 = C tracedλ T η t Eτt T η t )Lη t ; τ t E)dη t ST manifold = C{traceDλ T Eτt T ) η t } 4 43) Where C = log cd) ce) ). As trace is a linear operator so we can write the equation 4 43 with η t and η t = ST manifold η t Lη t ; τ t E)dη t = E ST η t ) = τ t ER 4 44) where R is given by equation 4 34 with H = E T τ T t τ t E = E T E as τ T t τ t = I k. Equation 4 43 becomes C{traceDλ T τ t ER) tracee 2 R)} and as tracee 2 R) does not depend on τ t, we can safely ignore that part while computing gradient w.r.t τ t. After rearranging the first part becomes CtraceA T τ t )) where A = ERDλ T. Similarly from ii), we have for a particular t, N n=1 φ n,t{log cg) + traceg η T t x n )} After removing the term that does not depend on τ t, we have N φ n,t traces T τ t ) where S = x n ERG T 4 45) n=1 So from i) and ii) the linear function that we have to maximize on Stiefel manifold will look like, F τ t ) = CtraceA T τ t )) + N φ n,t traces T τ t ) 4 46) This linear function of τ t can be maximized on Stiefel manifold by CG method as described in Edelman [7]. n= CG for minimizing F τ) on the Stiefel manifold given τ 0 s.t τ0 T τ 0 = I k, compute G 0 = F τ0 τ 0 Fτ T 0 τ 0 and set H 0 = G 0 For k = 0, 1, 2, Minimize F τ k t)) over t where τ k t) = τ k Mt) + QNt) 92

93 QR is the compact QR decomposition of I τ k τk T )H k, A = τk T H k, and Mt) and Nt) are p-by-p matrices given by the 2p-by-2p matrix exponential ) ) ) Mt) A R T Ip = exp t Nt) R 0 0 Set t k = t min and τ k+1 = τ k t k ) Compute G k+1 = F τk+1 τ k+1 F T τ k+1 τ k+1 Parallel transport tangent vector H k to the point τ k+1 set ηg k := G k or 0, which is not parallel. Compute the new search direction ηh k = H k Mt k ) τ k R T Nt k ) H k+1 = G k+1 + β k ηh k where β k = G k+1 ηg k, G k+1 G k, G k and 1, 2 = trace T 1 I 1 2 ττt ) 2 ) Reset H K+1 = G k+1 if k + 1 mod pn p) + pp 1)/2 Although, here we have used CG method, one can use Newton method to find the maxima on Stiefel manifold Update equation for φ n,t The terms from equation 4 35 which are relevant for updating φ n,t are U = E Q [log QV, η, z)] + N E Q [log pz n V )] + E Q [log px n z n )] 4 47) n=1 We need to maximize U w.r.t φ n,t, but note that we have a constraint T t=1 φ n,t = 1. By using the usual Lagrange multiplier technique we have U new = U + κ T t=1 )φ n,t 1) where κ is the Lagrange multiplier. Taking appropriate terms for φ n,t, rearranging, taking derivative of U new w.r.t φ n,t and equating it to zero, we have U new t 1 = 1 + log φ n,t ) + E Q [log1 v i )] + E Q [log v i ] + traces T τ t ) + κ φ n,t 93

94 U new φ n,t = 1 + log φ n,t ) + W t + κ = 0) 4 48) where W t = t 1 E Q[log1 v i )] + E Q [log v i ] + traces T τ t ). From the equation 4 48, we get φ n,t expw t ). Now by using the constraint T t=1 φ n,t = 1, we can easily get the value of the proportionality constant. Thus we can find the update equations for all the variational parameters and by updating them one by one we can maximize the lower bound of the marginal log-likelihood. This method is way faster than MCMC method but one of the potential drawbacks of this method is that it might get stuck in some local maxima as the optimization is typically non-convex Calculated KL-Divergence Let us write down the R.H.S of 4 35 which is the actual computed lower bound. G = E Q [log pv α)] E Q [log Q ν V γ)] }{{} i) N + n=1 E Q [log pz n V )] E Q [log Qz n φ)]) + } {{ } iii) + E Q [log pη λ)] E Q [log Qη τ)] }{{} ii) N n=1 E Q [log px n z n )] } {{ } iv) 4 49) Now clearly, term i), ii) and iii) are -KLQ p) between the distribution of corresponding variables and we can easily calculate them. So using the expressions that we have computer earlier for i), ii), iii) and iv) G becomes: G = [ T 1 [ log Bγ t,1, γ t,2 ) + 1 γ t,1 )ψγ t,1 ) + α γ t,2 )ψγ t,2 ) B1, α) t=1 ] ] 1 + α γ t,1 γ t,2 )ψγ t,1 + γ t,2 ) + [C tracedλ T τ t ER) tracee 2 R) ) ] + + [ [ N N n=1 t=1 n=1 t=1 T φ n,t log φ n,t + N n=1 t=1 T T j=t+1 ) ] φ n,j E Q [log1 v t )] + φ n,t E Q [log v t ] T φ n,t log cg) + traces T τ t ) ) ] 4 50) 94

95 Figure 4-5. log marginal probability of the data increases with number of iterations In Figure 4-5 we can see how G varies with the number of iterations. 4.7 Experiments In this study, we first validate our theory on a set of simulated experiments on synthetic data. We have reported our results here both on synthetic data set as well as real image data set. As this result is based on the nonparametric framework, so a priori the number of clusters were unknown. In most of the cases presented here, we almost have successfully found out the unknown number of cluster which is in other way solution to the model selection problem in traditional machine learning. On the other hand specially in a large collection of images it always make sense not to assume the number of cluster before starting the experiment. We will provide results to show that in our method we are circumventing this issue by using a nonparametric Bayesian framework Experiments on Synthetic Data In the case of experiments on simulated data constructed a total of eight experiments which contains different data set from the Stiefel manifold V n,p) for various values of n and p where p n. One of the main problem in this synthetic simulation is to generate data on 95

96 Table 4-1. Results for synthetic data set. V n,p) Accuracy #-of clustersest.) n p Nonparam-DPM C n % % % % % % % % 6 the Stiefel manifold. We employ a MCMC sampling algorithm to generate samples on that manifold. For further reference on this we refer reader to the paper [46]. In this paper, author developed an algorithm to sample from the matrix Bingham-von Mises-Fisher distribution. This is the one of the most general form of the distribution on Stiefel. We generated samples from appropriate Matrix Langevin distribution. In all the simulated experiments we have taken 4 classes and for each class we had 200 data points. in the following table we have given the over performance of our algorithm on all of the simulated data set. In the Table 4-1 we have pointed out the performance of clustering based on Dirichlet process mixture. We would again point out that number of cluster were totally unknown and based on the data, dynamically optimal number of clusters were selected. Not in all the cases we could not get the actual number of clusters, convergence to a local minima could be one possible reasons for this. For example among the reported 6 clusters, 2 clusters contained only 7 members each weights was 0.875%). In the MCMC implementation may be more burn-in time might be useful to overcome this problem. The important observation is that as r increases from one to two, three or four, the overall run-time of the algorithm increases. We also ran these experiments with 3 clusters also and the overall performance is very similar to those reported above Categorization of Objects Object categorization problem refers to the problem of grouping similar objects together in the same class. For evaluation of our technique on this problem, we ran the algorithm on a 96

97 Figure 4-6. Confusion matrix for all of the simulated data set subset of ETH-80 data set [62], [63]. We have used 6 different categories and each category contains 80 images. These categories were tomato, pear, car, horse, cup and cow. We evaluate our algorithm also on 3 classes, 4 classes, 5 classes and finally 6 classes. As anyone would imagine the performance accuracy would slightly decrease when we go higher in the number of categories. Feature extraction was one of the important parts in this experiment. We have used two different features in this context. The first one was a unit norm feature vector constructed similar to the method described in [64]. At first x-gradient and y-gradient of images were computed and they were computed at three different scales which are essentially three different standard deviation values for Gaussian filter. Here they were 1, 2 and 4. This resulted actually six images for each image in our data set. For each image a 32 bin histogram was computed and they were individually normalized and concatenated. It gave us a total of 192 length 97

98 Figure 4-7. Selected 6 object categories from the ETH-80 data set Table 4-2. Actual and Estimated number of cluster and accuracy with real data having different number of clusters Actual Accuracy Estimated % % % % 6 feature vector, which was then projected on a low dimensional subspace such that at least 99% variability could be captured. For our case it turned out that it was a 21 dimensional space on which all the feature vectors were projected [41]. Finally these vectors were normalized to generate the feature vectors which were then clustered using DP mixture. Another feature that we used in this experiments was Histogram of Oriented Gradients HOG) [65]. Both of these features were similar in performance. The following figures represents some object from our 6 category data set. Below, we are giving the performance accuracy for different number of clusters Classification of Outdoor Scenes We have used a subset of 3 categories from 8-Scene classification data set which were used earlier by [66] in their research. In this case we have used HOG to extract feature from 98

other features. The 3 categories were - mountain, coast and tall-building.

99 Figure 4-8. Confusion matrix for ETH-80 data set Figure 4-9. Selected 3 scene categories from the 8-Scene data set the data as it outperformed the other features. The 3 categories were - mountain, coast and tall-building. The following figure represents few images from this subset. Figure Confusion matrix for Outdoor Scene data set 99

100 Using DPM clustering method we successfully clustered the images with 90.42% of accuracy and importantly the algorithm found the 3 clusters correctly. All of the above experiments with synthetic and real data set show the promise in our work. 100

101 CHAPTER 5 BETA-DIRICHLET PROCESS AND CATEGORICAL INDIAN BUFFET PROCESS 5.1 Multivariate Liouville Distributions A Dirichlet distribution with parameter ᾱ = {α 1,, α Q } can be represented either on the Simplex S Q c S Q 1 o = {z 1,, z Q ) Q z i = 1} in R Q + or as a distribution inside the simplex = {z 1,, z Q 1 ) Q 1 z i 1} in R Q 1 +. We will refer S Q c and So Q 1 as Closed Simplex and Open Simplex, respectively. So the following two statements are equivalent. ȳ Dir Q ᾱ) on S Q c ȳ Dir Q 1 α 1,, α Q 1 ; α Q ) on S Q 1 o Definition 8. An Q 1 vector x in R Q + is said to have a Liouville distribution if x d = rȳ, where ȳ Dir Q ᾱ) on S Q c. r is an independent random variable with cdf F or pdf f if density exists). It is written as x L Q ᾱ; F ). We will use the following terminologies ȳ is the Dirichlet base with Dirichlet parameter ᾱ. r is the generating variate F and f are generating cdf and generating density, respectively. ) x Note that, x L Q ᾱ; F ) iff 1 x Q,, Q x Q Dir i x Q ᾱ) on S Q c and independent of i Q x i, which is equivalent to r. Now we will state a fact and its proof from [67]. Fact 1. A Liouville distribution L Q ᾱ; F ) has a generating density f iff the distribution has a density of the following Q x α i 1 i Γ Q α i) Γα i ) f x x. ) Q.) where x. = α i 1) Q x i This density is defined in the simplex {x 1,, x Q ) Q x i a} iff f is defined in the interval 0, a). 101

102 Proof. If r has the density f and ȳ Dir Q ᾱ), then the joint density of y 1,, y Q 1 ) and r is given by Γ Q α Q 1 i) Q Γα i) y α i 1 i ) Q 1 αq 1 1 y i f r) The proof of this fact is based of the random variable transformation from x 1,, x Q ) to y 1,, y Q 1, r) where r = Q x i = x. and y j = x j x. j = 1,, Q 1 As this transformation has Jacobian 1 r Q 1, we can easily obtain the fact. Similarly converse of the fact can be shown by the inverse transformation. The domain of the density is clear as well. If the generating variate r distributed as Betaα, β), then the particular form of Liouville distribution is called Beta-Liouville or Beta-Dirichlet distribution. Let us write down few important moments of Liouville distribution. Note that each y j is marginally beta distributed and ȳ is independent of r. Let us take α. = Q α i. Mean is given by Ex j ) = Er y j ) = Er) Ey j ) = Er) α j α. Variance is given by Varx j ) = Varr y j ) = Er 2 )Eyj 2 ) [Er)Ey j )] 2 α j = α α. ) 2. )α j + 1)E[r 2 ] α j α. + 1)E[r]) 2) α. + 1) α j = α α. ) 2. )α j + 1)Var[r] + α. α j )E[r]) 2) α. + 1) Co-variance is given by j < i) Covx j, x i ) = Er 2 )Ey j y i ) [Er)] 2 Ey j )Ey i ) α j α i = α α. ) 2. )Var[r] E[r]) 2) α. + 1) 102

103 For each j < i, the term has the same sign and it will be negative only iff α. )Var[r] < E[r] = Var[r] E[r] < α. ) 1 2 = Coefficient of Variation r) < α. ) 1 2 on S Q o 5.2 Beta-Dirichlet BD) Distribution { As before let S Q o = x [0, 1] Q : } Q x j 1. A random vector X = {X 1,, X Q } is said to follow a BD distribution with parameters α, β, γ 1, γ 2,, γ Q ), if Q X j follows a Beta distribution with parameters α, β) and Y j = X j / Q X j for all j = 1, 2,, Q follows a Dirichlet distribution with parameters γ 1, γ 2,, γ Q ). The probability density is given by Γ Q γ j) Q Γγ j) Q ) α Q Γα + β) γ j x j 1 Γα)Γβ) ) β 1 Q x j Q x γ j 1 j Ix S Q o ) 5 1) where I is the indicator function. A Q + 1)-dimensional Dirichlet distribution with parameters γ 1,, γ Q+1 is a special case of BD distribution with parameters α = Q γ j, β = γ Q+1 and γ j = γ j for all j = 1, 2,, Q. below: 5.3 Normalization Constant by Liouville Extension of Dirichlet Integral Liouville established a more general integral extending the famous Dirichlet integral) as f x x Q ) x 1 + +x Q <h Q x γ i 1 i dx i = Q Γγ i) Γγ γ Q ) h 0 f t) t γ 1+ +γ Q 1) dt Now from the above equation, if we put h = 1 and f t) = t α γ 1+ +γ Q )) 1 t) β 1 then the LHS becomes the density function of BD and the RHS becomes = Q Γγ i) Γγ γ Q ) Q Γγ i) Γγ γ Q ) f t) t γ 1+ +γ Q 1) dt t α γ 1+ +γ Q )) 1 t) β 1 t γ 1+ +γ Q 1) dt 103

104 = = Q Γγ 1 i) t α 1 1 t) β 1 dt Γγ γ Q ) 0 Q Γγ i) Γα)Γβ) Γγ γ Q ) Γα + β) 5 2) and thus we derive the normalization constant for BD. There is a simpler way to derive it by using the above fact. Let us write x. = Q x i. So BD density can be also written as the following Γ Q γ Q 1 i) Q Γγ i) y γ i 1 i ) Q 1 γq 1 1 y i f x. ) where y j is defined as above where generating density is Beta and it is distributed as Γα+β) Γα)Γβ) x.) α 1 1 x. ) β 1. Putting back the values, we have the BD density as, [ Γ Q γ ] j) Q [ ] Q Γγ y j ) γ j 1 Γα + β) j) Γα)Γβ) x.) α 1 1 x. ) β 1 5 3) }{{}}{{} Beta Dirichlet where y Q = 1 ) Q 1 y j. So this density which is defined on S Q c [0, 1] or So Q 1 [0, 1] is equal to the one defined in equation 5 1. From here, we can also see it gives rise to the same normalization constant as 5 2. So it can be generated in the following stages - Generate x. from a beta distribution with parameters α and β. Generate y 1, y 2,, y Q ) from a Dirichlet distribution with parameters γ 1,, γ Q ). Multiply the vector y 1, y 2,, y Q ) with x. to get x 1, x 2,, x Q ) With Multinomial Likelihood 5.4 BD Distribution Conjugacy Let us write down the discrete time BD density function Q-dimension) with parameters α, β and γ 1,, γ Q Πp 1, p 2,, p Q ) Q ) α Q γ j p j 1 ) β 1 Q p j Q p γ j 1 j Ip S Q o ) 104

105 Let X 0, X 1,, X Q follows a Q + 1)-dimensional Multinomial with parameters n and p 0, p 1,, p Q such that Q j=0 p j = 1. Let us denote the number of outcomes in each categories by n j for all j = 0, 1,, Q and n = Q j=0 n j. Its pmf, which is the likelihood, is given by LX 0 = n 0, X 1 = n 1, X Q = n Q ) p n 0 0 = Q 1 p n j j ) n0 Q p j So the posterior distribution of p j for j = 0, 1, Q given the data becomes Q p n j j Π L Π Q ) α+ p j = Q Where updated parameters Q n j ) Q γ j +n j )) ) α Q γ j p j 1 1 ) β 1 Q p j ) β+n0) 1 Q p j Q p γ j j Q p γ j +n j ) 1 j for Dirichlet part) γ j = γ j + n j j = 1, 2, Q Q for Beta part) α = α + n j and β = β + n 0 It is clear that in the discrete case, BD is a conjugate prior for multinomial likelihood With Negative Multinomial Likelihood The Negative Multinomial NM) distribution is a generalization of the negative binomial distribution for more than two outcomes. Suppose an experiment generates Q outcomes, namely {n 0,, n Q }, each with probabilities {p 0,, p Q } respectively. If sampling is done until n observations, then {n 0,, n Q } would be distributed with multinomial distribution. However, if the experiment is stopped once n 0 reaches the predetermined value r 0, then the distribution of the Q-tuple {n 1,, n Q } is NM. 105

106 Let the Negative multinomial is represented by NMr 0, p 1,, p Q ), such that Q p j 1. The distribution function is given by Γr 0 + Q n j) Γr 0 ) Q n 1 j! ) r0 Q p j Q p j n j 5 4) where n j corresponds to p j. Using the BD density defined above we have the posterior as, Q p j α+ Q n j ) Q γ j +n j ) 1 ) β+r0) 1 Q p j Q γ p j +n j ) 1 j Similarly, this is also a BD with updated parameters BDα, β, γ 1,, γ Q )), where the updated parameters are, for Dirichlet part) γ j = γ j + n j j = 1, 2, Q Q for Beta part) α = α + n j and β = β + r 0 So in the discrete case, BD is also conjugate prior for negative multinomial likelihood. 5.5 Completely Random Measure CRM) Representation Consider a probability space Ω, F, P). A random measure µ is such that µa) is a non-negative random variable for any set A F. Now for any disjoint measurable set A, A F, if µa) and µa ) are independent random variables, then µ is called a CRM. Borrowing the terminology from [20], we can see µ is composed of at most three components, µ = µ d + µ f + µ o A deterministic measure denoted by µ d and clearly µ d A) and µ d A ) are independent for disjoint A and A. A finite fixed atoms: let u 1, u 2,, u L ) Ω L be a collection of fixed finite number of locations and let η 1, η 2,, η L ) R L + are the independent random weights for those atoms, respectively. Then µ f = L l=1 η lδ ul. An ordinary component : let ν be a Poisson process intensity on space Ω R +. Let v 1, ξ 1 ), v 2, ξ 2 ), ) be a draw from this process. Then µ o = ξ jδ vj. 106

107 Through out this article, we have assumed that µ d is identically equal to 0 as it is a non-random component. From [20], we have CRM representation for Beta processbp), Gamma processgp) and Dirichlet processdp) and we can also identify the fixed and ordinary component of those. For example, Beta process is an example of a CRM with a mass parameter γ > 0, a concentration parameter θ > 0, an a.s purely atomic measure H f = L l=1 ρ lδ ul with ρ l [0, 1] for all l and a purely absolutely continuous measure H o on Ω. So the CRM components are: µ d is uniformly 0 Fixed atom locations are u 1,, u L ) Ω L and atom weights η l is distributed as η l ind Betaθγρ l, θ1 γρ l )) Ordinary component has Poisson process intensity H o ν, where ν is the σ-finite measure with finite mean νdp) = γθp 1 1 p) θ 1 dp CRM for BP can be be written as including both continuous and discrete part) B = p k δ ωk L η l δ ul + J ξ j δ vj k=1 l=1 j The atom locations are union of both the atoms {Ω k } k = {u l } l=1 L {v j } J. Clearly BP is almost surely discrete. There is an alternative way to define the Beta Process where it can be written as B BPθ, γ, [u, ρ, σ], H o ). It describes the following CRM with a mass parameter γ > 0, a concentration parameter θ > 0, L number of atoms with locations u 1, u 2, u L ) Ω L, two sets of positive atom weight parameters {ρ l } L l=1 and {σ l} L l=1 and a purely absolutely continuous measure H o on Ω µ d is zero. L fixed atom locations u 1,, u L ) Ω L with corresponding weights η l ind Betaρ l, σ l ). The ordinary component has Poisson process intensity H o ν, where νdp) = γθp 1 1 p) θ 1 dp 107

108 5.5.1 Another Viewpoint for CRM Often times it is very easy to work with CRM representation for these non-parametric priors. So let us look into CRM from a functional of a Poisson random measure viewpoint [68]. Let T be any topological space with a σ-algebra BT ). Let U = Ω, F, P) any probability space and X be a complete, separable metric space. Let us define a Poisson random measure N on S = R + X with mean measure ν. For any set A in BS) such that νa) = ENA)) < and the random variable NA) is distributed as PoissonνA)). For any finite collection of disjoint sets A 1,, A n in BS), NA 1 ),, NA n ) are independent random variables. The following conditions need to be satisfied 1 0 pνdp, X) < and ν[1, ) X) < Now let us denote the space of all bounded finite measure on X by M X, BM X )). Let µ be a random element on U taking values in M X, BM X )) which can be written as µc) = pndp, dx) R + C C BX) So µ is a linear functional of Poisson random measure and it is Kingman s CRM on X. So for any disjoint sets X 1, X 2, in BX), µx 1 ), µx 2 ), are mutually independent and µ i A i ) = i µa i) almost surely-p. If g : X R + is a measurable function then µ can be characterize by the following Laplace functional )) [ E exp gx)µdx) = exp e pgx) 1 ] ) νdp, dx) X S Now, if νdp, dx) = λdp)hdx) for some measure λ ) on R + and H is a non-atomic σ-finite measure on X, then N and µ are called homogeneous. Here we will only consider the homogeneous case. Also note that, we could have defined more general linear functional like S X hp)ndp, dx) 108

109 where S is a separable complete metric space and h : S R +. They are known as h-biased random measures. But we are using a very simple h, which is hp) = p, so it is also called size-biased random measure Campbell s Theorem I would also like to state Campbell s theorem from [17], which is closely related to these concepts. The proof can also be found in [17]. Theorem Let Π be a Poisson process on S with mean measure µ and let f : S R be measurable. Then the sum Σ = X Π is absolutely convergent with probability one iff If this condition holds, then S f X ) min f x), 1)µdx) < { Ee θσ ) = exp e θf x) 1 ) } µdx) S for any complex θ for which the integral on the right converges, and in particular whenever θ is pure imaginary. Moreover EΣ) = S f x)µdx) in the sense that the expectation exists iff the integral converges, and they are equal. If the above equation converges, then VarΣ) = f x) 2 µdx) S finite or infinite. 5.6 Beta Dirichlet Process Recently in [3] Beta-Dirichlet BD) prior process was introduced in the context of analyzing multi-state event history data. It is a conjugate prior for cumulative intensity 109

110 functions of a Markov process based on multivariate non-decreasing process with independent increment which can be considered as an extension of Beta process introduced by Hjort [18] earlier BP construction by taking limit from discrete case So Let us first review the construction of Beta Process by taking the limit from the time-discrete model. We have already seen the definition of Beta Process in chapter 3. BP is a specific example of pure jump subordinator. In [14] reader can get excellent reference of these type of Lévy processes. We know that they can be equivalently defined through Kingman s CRM. In the original construction of Hjort [18], BP was first defined in the time discrete case. After that for time-continuous case he took the limit of the discrete model. Let us take A 0 be continuous and c.) be piecewise continuous, non-negative function. The construction was as follows For each n, let us define independent variables X n,i Betaa n,i, b n,i ) for i = 1, 2,. Where a n,i = c n,i A 0 i 1 n, i n ] b n,i = c n,i 1 A 0 i 1 n, i n ]) and c n,i = c i 1 2 n ) Lets A n 0) = 0 and A n t) = i n t X n,i; Then A n has independent beta increments The number of jumps increases as n The expected jump size becomes smaller as n The sequence {A n } converge in distribution to the correct Beta process in the Skorokhod space D[0, 1] Construction of BDP For discrete time model, Dirichlet distribution is a natural conjugate prior for transition probabilities. But Hjort showed that cumulative intensity functions are independent in the limit see theorem 5.1 in [18]) if we start with Dirichlet model. To be precise let us assume that they are denoted by A 1, A 2,, A Q. If they are independent, then the cardinality of {k : A k t) > 0} is either 0 or 1. Now it is clear that independent Beta processes are 110

111 not conjugate in this case. To remove this problem BD prior was introduced and cumulative intensity functions become dependent in the limit under this new prior. We have also seen the effect of this independence when we tried to extend the IBP in a standard way using Dirichlet distribution. We will briefly describe BD and the properties as mentioned in [3]. Let Ā h = A hj : j = 0, 1,, Q) be the vector of cumulative intensity function s from state h. We know that for j h, A hj t) 0 and j h A hjt) 1. Also the cumulative intensity functions for transitions from state h is given by A h. t) = j h A hjt). Let us take, v hj t) = da hj t) da h. t) which denotes instantaneous conditional transition probability from state h to j. Properties. For given Ā h, we can see that from properties of BD distribution) A h. t) and {v hj t), h j} are completely independent. Note that A h. is a Beta process and {p hj t), h j} follows a Dirichlet distribution given t. Underlying Q-dimensional Lévy process representation of BD process Ā with parameters A 0 ), c ), {γ j )} Q can be written as [ t ) ] Eexp < θ, Ā >)) = exp e < θ, x> 1 νds, d x) 0 [0,1] Q where νdt, d x) = Υ x γ x γ Q 1 Q x.) Q γ j 1 x.) ct) 1 dx 1 dx Q da 0 t) and Q x j = x. and Υ = ct) Γ Q γ j) Q Γγ j) Let us take y j = x j /x. for all j = 1, 2, Q. Now let us write the above Lévy measure in term of these new variables y j. Note that Q y j = Q x j/x. = 1. So we have by using the result from equation 5 3, { Γ Q γ } j) Q { } Q Γγ y j ) γ j 1 dy 1 dy Q ct)x.) 1 1 x.) ct) 1 dx.) da 0 t) j) }{{}}{{}

112 Now we can see clearly from the part 1) of the above equation that y 1,, y Q )) is Dirichlet distributed with parameters γ 1,, γ Q ) while x. is distributed as Beta with parameters ct) and A 0 t) from part 2) of the equation. In [3] paper the authors talked about the sample path of Q-dimensional BD process with parameters c ), A 0 ), γ 1 ),, γ Q ). It can be generated as follows. Let Ā. be a beta process with parameters c ) and A 0 ). Now conditioned on Ā., the BD process Āt) can be constructed as a Q-dimensional Lévy process as: Āt) = V s) Ā. s) s t,s T where, T = {t : Ā.t) > 0} and V s)s T ) are independent Dirichlet random variables with parameters γ 1 s), γ 2 s),, γ Q s)). For simplicity, here we assumed V s) V, where all the parameters of Dirichlet are fixed to γ 1, γ 2,, γ Q ) instead of being function of s. To make it even simpler we will assume that c ) is also a constant equal to c. Mean of this process can be specified for j-th co-ordinate by EA j t)) = t 0 γ j s) k γ ks) da 0s) j = 1,, Q Variance of this process can be specified for j-th co-ordinate by VarA j t)) = t 0 γ j s)γ j s) + 1) k γ ks) 1 k γ ks) + 1) cs) + 1 da 0s) j = 1,, Q Co-variance of this process between j-th and i-th co-ordinates can be specified by CovA j t), A i t)) = t 0 γ j s)γ i s) k γ ks) 1 k γ ks) + 1) cs) + 1 da 0s) j, i = 1,, Q where j i 5.7 Multivariate CRM MCRM) Representation of BDP The definition of CRM can be extended for vector valued CRM [69] or MCRM as well. For example, µ = µ 1,, µ Q ) is a completely random measures such that for A and A the vectors µ 1 A),, µ Q A)) and µ 1 A ),, µ Q A )) are independent. Like one-dimensional case, we can write down the MCRM representation of BDP as follows, µ = µ d + µ f + µ o. The deterministic part µ d is uniformly zero, µ d =

113 The finite fixed atoms: let u 1, u 2,, u L ) Ω L are fixed atoms and let η 1,, η L ) R + ) Q ) L are independent vectors of weights for those atoms. µ f = L η l δ ul = l=1 L η l1, η l2,, η lq )δ ul l=1 The ordinary component: let ν be the Poisson process intensity defined on Ω R + ) Q. Let v 1, ξ 1 ), v 2, ξ 2 ), ) be a draw from this process. Then µ o = ξ i δ vi = ξ i1, ξ i2,, ξ iq )δ vi BD process is an example of a vector valued CRM or MCRM with a mass parameter β > 0, a concentration parameter θ > 0, an a.s purely atomic measure H f = L l=1 ρ lδ ul with ρ l [0, 1] for all l and a purely absolutely continuous measure H o on Ω. So for BD process let us write down the MCRM representation as follows: µ d is uniformly 0. Fixed atom locations are u 1,, u L ) Ω L and independent random weights η l = η l1, η l2,, η lq ) [0, 1] Q for those atoms are distributed as η l ind Beta-Dirichletθβρ l, θ1 βρ l ), γ f 1, γ f 2,, γ fq )) Ordinary component has Poisson process intensity H o ν, where ν is the σ-finite vector-valued [0, 1] Q -valued) measure with finite mean νd x) = Γ Q γ oj) Q Γγ oj) βθ x γ o1 1 1 x γ oq 1 Q s Q γ oj 1 s) θ 1 dx 1 dx Q where s = Q x j 1. So complete MCRM for BDP can be be written as: D = p k δ ωk L η l δ ul + M ξ i δ vi = L η l1,, η lq )δ ul + M ξ i1,, ξ iq )δ vi k=1 l=1 l=1 Alternative version of BD process can be written as : [ ] D BDPθ, γ, u, ρ, σ, {γ fj } Q, {γ oj } Q, H o) 113

114 µ d is 0. L fixed atom locations are u 1,, u L ) ω L with corresponding random weights η l = η l1, η l2,, η lq ) ind Beta-Dirichletρ l, σ l, γ f 1, γ f 2,, γ fq )) Ordinary component has Poisson process intensity H o ν, where ν is the σ-finite vector-valued [0, 1] Q -valued) measure with finite mean νd x) = Γ Q γ oj) Q Γγ oj) βθ x γ o1 1 1 x γ oq 1 Q s Q γ oj 1 s) θ 1 dx 1 dx Q where s = Q x j Beta-Dirichlet process as a Poisson process 1 1 p 2 p 1 1 V 1 0 ω τ 0 ω τ νd p, dω) Figure 5-1. Left side) Poisson process Π on [0, τ] S 2 o with mean measure ν = h µ. The set V contains a Poisson distributed number of atoms with parameter hdω)µd p). Right side) One draw from BD constructed from Π. The first S dimension is the location and other dimensions constitute the weight vector. Note that the open simplex is denoted by S Q o. CRM for BD can be written as k=1 p kδ ωk. We illustrate this Poisson connection in the above Figure 5-1taking Q = 2). As in the figure p 1k + p 2k 1 for all k and that is depicted by the dotted line and the projected value which is under the line p 1 + p 2 = 1. The underlying Poisson process is denoted by Π = {ω, p)}. Poisson process is completely characterized by its mean measure νdω, d p). For any subset V [0, τ] S Q o, the random 114

115 counting measure NV ) is the number of points generated by Π which are inside V, therefore NV ) is Poisson distributed with mean measure νv ). Also if we have pairwise disjoint set V 1 and V 2, then NV 1 ) and NV 2 ) would be independent. So in case of BD process, the mean measure of Poisson process is given by ν BD dω d p) = hdω) µd p) Γ Q = γ j) Γγ 1 ) Γγ Q ) c pγ 1 1 γ 1 p Q 1 Q p.) Q γ j 1 p.) c 1 d p hdω) where p. = Q p j 1 and p denotes that it is a vector-valued random process. The measure ν BD d p dω) is called the Lévy measure of the process and h is also called its base measure. 5.9 A Size-biased Construction for Levy representation of BDP In the construction of CRM, Kingman [30] pointed out that the Ordinary component of any CRM can be decomposed countable number of independent components. µ o = k µ k o Let measure ν and ν k are the compensator of the Lévy process corresponding to µ o and µ k o, respectively. Let us denote the underlying Poisson process by Π and Π k with mean measures ν and ν k, respectively. The above equation leads to the following equations ν = k Π = k ν k Π k Let us now state a very useful theorem called Superposition Theorem in Poisson point process theory. The proof is very easy and can be found in [17]. Theorem Let Π 1, Π 2, be a countable collection of independent Poisson Processes on S. Let Π k has a mean measure ν k. Then the superposition Π = k Πk is again a Poisson process with mean measure ν = k νk. 115

116 If we take the union of all the underlying Poisson process denoted by Π k, then it will have the Lévy measure ν as a result. Now we have a very simple method to construct µ o by taking union of all the underlying Poisson process denoted by Π k. In CRM framework our Poisson process is defined on the S = [0, 1] Ω and µ k can re written as f ω j ),ω j Π k f ω j )δ ωj Let the Q dimensional Beta-Dirichlet Process is given by Āt) Beta-Dirichlet ct), A 0 t), γ 1 t),, γ Q t)) We can write its Lévy-Khintchine representation as E { exp [ ū T Āt) ]} { } = exp 1 e ūt z )dl t z) [0,1] Q Āt) has underlying Lévy measure L t and we know V B[0,1] Q ) dl t z) = ν[0, t], V). For simplicity, let us assume ct) c and γ j t) γ j for all j = 1,, Q. Let us write down the Lévy measure dl t z) = t s=0 νds, d z) = { t Υ s=0 Q z γ j 1 j ) z.) Q γ j 1 z.) c 1 da 0 s) } dz 1 dz Q where t 0 and z S Q o and Q z j = z. and Υ = c Γ Q γ j) Q Γγ j) Note that Υ does not depend on t. becomes, Let y j = z j z. L = exp = exp { { for j = 1, 2,, Q 1) and let y Q = z.. Now the Laplace transform of Āt) [0,1] Q S Q 1 o t s=0 1 t 0 ) Q e ūt z 1 Υ s=0 ) e ūt ȳ 1 Υ z γ j 1 j Q 1 ) z.) Q γ j 1 z.) c 1 da 0 s) dz 1 dz Q } y γ j 1 j ) ) Q 1 γq 1 1 y j 116

117 y Q ) 1 1 y Q ) c 1 da 0 s) dy 1 dy Q 1 dy Q } where ȳ = y Q y 1,, y Q 1, 1 ) Q 1 y j) 1 y Q = As we know y Q 0, 1), we can write, y Q ) = 1+1 y Q)+1 y Q ) 2 + = 1 y Q ) k as 1 y Q ) 0, 1). Using this identity in expression for L and writing Q 1 y γ j 1 j 1 Q 1 j) y γq 1 = GȳQ 1 ), we have L = exp = = = = { S Q 1 o { exp k=0 { exp k=0 { exp k=0 { exp k=0 1 t 0 S Q 1 o S Q 1 o S Q 1 o [0,1] Q s=0 1 t 0 s=0 1 t 0 k=0 ) ) e ūt ȳ 1 Υ Gȳ Q 1 ) 1 y Q ) k 1 y Q ) c 1 s=0 1 t 0 t s=0 s=0 } Q 1 da 0 s) dy j dy Q ) e ūt ȳ 1 e ūt ȳ 1 da 0 s) Q 1 k=0 Υ Gȳ Q 1 ) 1 y Q ) c+k 1 da 0 s) Q 1 ) Υ c Gȳ Q 1) c y Q ) y Q ) c+k 1 dy j dy Q } e ūt ȳ 1) Υ c Gȳ Q 1) [ ] Q 1 } c c + k da 0s) dy j dy Q ) e ūt z 1 Υ [ c ] c + k da 0s) Q z γ j 1 j ) dz 1 dz Q } dy j dy Q } [c + k) y Q ) y Q ) c+k 1] z.) 1 Q γ j 1 z.) c+k 1 This last equality comes from reversing the original transform. 117

118 Theorem For a Beta-Dirichlet Process Ā BDPcs), A 0 s), γ 1 s),, γ Q s)) with base measure A 0, concentration measure cs) and Dirichlet parameters {γ j s)} Q. Let Π is its underlying Poisson process and ν be its Lévy measure. Then Π and Ā can be written as Π = k Π k and Ā = k Ā k where Ā k is the Lévy process with underlying Poisson process Π k. The Lévy measure ν k is the decomposition of ν such that, ν k ds, d z) = Beta-Dirichlet z ; 1, c + k, γ 1,, γ Q ) d z A k 0ds) A k 0ds) = c c + k A 0ds) and we have νds, d z) = ν k ds, d z) k=0 The above theorem shows that we can write down the underlying Poisson process of original Beta-Dirichlet process is a superposition of countable independent Poisson process Π k with corresponding mean measure ν k. So any BDP can be expressed as a countable sum of independent Lévy processes Ā k with Lévy measure ν k and Π k as the underlying Poisson Process. Note that Ā k is no longer a BDP as it violates the definition of BDP BD-Categorical Process Conjugacy We have seen the definition of BDP. In order to use BDP in machine learning application, we have to couple it with either Categorical Process CaP) or Negative Multinomial Process NMP). Let us first define CaP and NMP with underlying base or hazard measure Ā. In this context, we will focus on the fact that the base measure is discrete and a draw from a BDP. But these processes can be defined in a very general setting as well. We will try to keep our notations similar to the standard notations found in recent machine learning literature for example [20]. 118

119 5.11 Categorical Process CaP) Let Ā be a vector valued measure defined on Ω. We define a Categorical Process X with base measure Ā and denote X CaPĀ). If Ā is continuous then X is a marked Poisson process. If Ā is discrete then Ā is of the form Ā = k=1 p kδω k where p = p 1,, p Q ), the X = k=1 c kδω k where c k = c k0, c k1,, c kq ). Note that when we will try to draw a random vector with p from a Categorical distribution we need to augment p with p 0 = 1 Q p j) and as a out put we will get a Q + 1)-dimensional vector where only one 1 can appear. When c 0 = 1, i.e if c = 1, 0,, 0) we will treat this as a special vector. If Ā has two parts - continuous as well as discrete, then X will also have two parts. This CRM definition of CaP enables us to carry out easier calculation. In figure 5-2 we can see how we are generating X. Similar to [2], we can treat each ω k Ω as a latent feature. Previously, for example, Bernoulli ProcessBeP) and Indian Buffet ProcessIBP) we only care about existence or non-existence of the features but now we can think of the features which comes with a choice or category and thus we have a more general process like CaP or categorical IBP discussed in later section in this chapter). We will also show that BDP is the De Finetti mixing distribution for c-ibp like CRP for DP or BP for IBP Conjugacy for CaP and BDP CRM Formulation Let us write down the full CRM specification for BDP and CaP where we will have only the discrete part. CaP will be denoted by CaPĀ). It has a single parameter, i.e. the base or hazard measure Ā, which is discrete and has atoms that takes values in S Q o. We will focus particularly on the model where Ā is drawn from a BD process whose draw is almost surely discrete. The entire model is the following X CaPĀ) Ā BDPθ, β, γ 1,, γ Q ), G) 119

120 1 0 c i c i c i0 0 X CaPĀ) Ω {0,1,2} p i2 p i1 appending p i0 p i0 = 1 p 1i +p i2 )) p i2 p i1 Ā BDPθ,β,γ 1,,γ Q ),G f,g o ) fixed points random points Ω S 2 o Figure 5-2. BD-Categorical process with Q = 2 where θ is the concentration parameter, β is the mass parameter, γ 1,, γ Q ) is the Dirichlet parameter and G is the base measure such that GΩ) = β. Here G is assumed to be absolutely continuous. Note that a Categorical distribution is a special case of multinomial distribution with total outcome is equal to 1. Also we know that it is the generalization of the Bernoulli distribution for a categorical random variable. In one formulation of the distribution, the sample space is taken to be a finite sequence of integers. The exact integers used as labels are not important. We will use {0, 1,, Q} for convenience. In this case, the probability mass function f is f x = j p) = p j, 120

121 where p j represents the probability of seeing element j and Q j=0 p j = 1. Another formulation that appears more complex but facilitates mathematical manipulations is as follows, using the Iverson bracket f x p) = where [x = j] evaluates to 1 if x = j, 0 otherwise. Q j=0 p [x=j] j, BDP Ā is now coupled with a CaP, denoted by CaPĀ). It has a single parameter, the base measure Ā, which is discrete and has atoms that takes values in [0, 1] Q. We will focus particularly on the model where Ā is drawn from a BD process whose draw is almost surely discrete. We write Ā = p k δ ωk = k=1 p k1, p k2,, p kq ) T δ ωk Note that, Q p kj 1 for all k = 1, 2,. Let us take p k0 = k=1 1 ) Q p kj. Now {p kj } Q j=0 can be treated as a prior probability distribution for a categorical distribution with Q + 1) components. We say that the random measure X is drawn from a Categorical process, X CaPĀ), if X = c k δ ωk = k=1 c k0, c k1,, c kq ) c k0, c k1,, c kq ) T δ ωk where k=1 ind Categoricalp k0, p k1,, p kq ) for k = 1, 2,. Note that only one entry of c k0, c k1,, c kq ) is 1 and all other entries are 0s for all k = 1, 2,. So The categorical process lies on the space Ω {0, 1,, Q} as shown in figure 5-2. Here we assumed that the base measure has both discrete and continuous part. Now suppose BDP Ā has the parameters θ > 0, β > 0, {γ f,j } Q, {γ o,j} Q and base measure G. We assume G is any arbitrary CRM with three parts G = G d + G f + G o according to [30]. In all proofs we have assumed G d = 0. So we will take G = G f + G o. In later section, we have 121

122 assumed that G itself is a draw from a BP G = k b kδ ωk in the hierarchical model. Ā BDPθ, β, {γ f,j } Q, {γ o,j} Q, G f, G o ) and X CaPĀ). We refer the overall process as the Beta-Dirichlet Categorical process BDCaP) BDCaP Conjugacy With Standard Parametrization Theorem Let G be a measure with fixed atomic component G f = L l=1 η lδ ul and absolutely continuous component G o. Let θ > 0 and β > 0. Consider N conditionallyindependent draws from CaP as follows X n = with L c n,lδ f ul + l=1 M c o n,iδ vi i.i.d. CaPĀ), for n = 1, 2,, N ) Ā BDP θ, β, {γ f,j } Q, {γ o,j} Q, G f, G o That is, the CaP draw has M atoms that are not located at the atoms of G f. Then Ā X 1,, X N ) BDP θ post, β post, {γ post f,j } Q, {γpost o,j } Q, G post f ), Go post with θ post = θ + N, β post = β θ θ+n and. Also G post o = G o and where, and G post f = L l=1 η post l δ ul + η post l = η l + θβ) 1 ξ post i = θβ) 1 N M N n=1 q=1 Q n=1 q=1 ξ post i δ vi Q c o,n,i,q c f,n,l,q l = 1,, L i = 1,, M Also we have, γ post f,l,j = γ fj + γ post o,i,j = γ oj + N n=1 N n=1 c f,n,l,j c o,n,i,j l = 1,, L and j = 1, 2,, Q i = 1,, M and j = 1, 2,, Q 122

123 Note that, the second sum Q q=1 is not really a sum because we have exactly one q = q say) for which c,n,l,q = 1 and rest are all zeros for all q such that q {1, 2,, Q}) q q )). Proof. This can be easily verified from Theorem 3 of Kim in [3]. Remark:. c,n,l,q is the q-th component of the vector c,n,l and we can also find out the predictive distribution by integrating out the underlying BDP. We end up with the Categorical Indian Buffet Process cibp) which is the direct analogue of the multidimensional version of the work in [2]. We have shown cibp construction in later section BDCaP Conjugacy Using Alternative Parametrization for BDP in the Base Measure G) Theorem Assume the conditions for the above theorem and consider N conditionallyindependent draws from CaP: X n = with L c f,n,l δ ul + l=1 Ā BDP M c o,n,i δ vi i.i.d. CaPĀ), for n = 1, 2,, N [ ] ) θ, β, {γ oj } Q, u, ρ, σ, {γ fj } Q, G o where all these boldface symbol for L corresponding atoms. In this alternative definition of BDP the base measure is characterized by the tuple G G f, G o ) [u, ρ, σ, {γ fj } Q ], G o). {ρ l } L l=1 and {σ l} L l=1 are all > 0. {γ fj} Q are the Dirichlet parameters corresponding to fixed part and are assumed to be same for all fixed atoms. Similarly {γ oj } Q are the Dirichlet parameters corresponding to random part and are assumed to be same for all random atoms. Then [ ] ) Ā X 1,, X N BDP θ post, β post, {γ post oj } Q, u post, ρ post, σ post, {γ post fj } Q, G post o with θ post = θ + N, β post = β θ θ+n and G post o {u post l } = {u l } L l=1 {v i} M. ρpost, σ post will be such that for l = 1, 2,, L ρ post l = ρ l + N = G o and total L + M fixed atoms, n=1 q=1 Q c f,n,l,q and 123

124 σ post l = σ l + N N Q n=1 q=1 c f,n,l,q and for i = 1, 2,, M ρ post L+i = N n=1 q=1 σ post L+i = θ + N Q c o,n,i,q and N Q n=1 q=1 c o,n,i,q Also we have, γ post f,l,j = γ fj + γ post o,i,j = γ oj + N n=1 N n=1 c f,n,l,j c o,n,i,j l = 1,, L and j = 1, 2,, Q i = 1,, M and j = 1, 2,, Q Proof. This is immediate from previous Theorem BDCaP - Conjugacy - Proof Statement Theorem Let Ā prior = k=1 p kδ ωk be a discrete, CRM on [0, 1] Q with atom locations in [0, 1]. Let p. = Q p j. Suppose it has the following components. There is no deterministic component. The ordinary component is generated from a Poisson point process with intensity ν c dω, d p) = ν c d p)dω such that ν c is absolutely continuous and ν c [0, 1] Q ) <. In particular, the Q-dimensional weights are the p axes and the atom locations are in the ω axis. There are L fixed atoms at locations u 1, u 2,, u L [0, 1]. The Q-dimensional weight of the l-th fixed atom is a random variable with distribution dg l. Draw one CaP X with input measure Ā prior. Let us assume there is only one non-zero atom of X and let { c 1, s 1 } be the pair of observed data with corresponding location. Non-zero atom means that the categorical data vector has a 1 which can appear in any of the last Q positions 124

125 only, so c 10 can not be 1. Note that c = {c 10, c 11,, c 1Q } is Q + 1)-dimensional vector with one 1 in any position j = 1,, Q. The posterior process given the CaP X is a CRM Ā post with the following components. There is no deterministic component. The ordinary component is generated from a Poisson point process with intensity There are three types of fixed atoms. 1 p.)ν c dω, d p) = 1 p.)ν c d p) dω 1. There is old, repeated fixed atom. If u l = s 1, then there is a fixed atom at u l with weight density 1 Q p I [c 1j =1] j dg l p) where W rf W rf is the normalizing constant W rf = Q p [0,1] Q p I [c 1j =1] j dg l p) 2. There are old, unrepeated fixed atoms. If u l s 1, then there is a fixed atom at u l with weight density 1 W uf 1 p. ) dg l p) where W uf is the normalizing constant W uf = 1 p. ) dg l p) p [0,1] Q 3. There is a new fixed atom. If s 1 / {u 1,, u L }, then there is a fixed atom at s 1 with weight density 1 Q p I [c 1j =1] j ν c d p) W new where W new is the normalizing constant W new = Q p [0,1] Q Proof. This can be also verified from Theorem 3 of Kim in [3]. p I [c 1j =1] j ν c d p) 125

126 5.13 Extension of Indian Buffet Process Latent feature model has been extensively used in various application of machine learning. In non-parametric Bayesian set up often the number of features is unknown. One of the important works in this direction can be found in here [11] where a stochastic process named IBP is defined. It is a prior for generating infinite binary matrices. In a metaphorical way it is process of tasting dishes in an infinite buffet by N number of customers. Let Z i be the binary vector where Z ik = 1 if customerobject) i tastes the dishfeature) k. Customer i tastes the dish k with probability m k i, which is the indicator of the popularity of the dish so far. Having sampled the previously sampled dishes, i-th customer tries new N set of dishes, where N is drawn from a Poisson α ) distribution. The exchangeable distribution of IBP can be obtained i which is invariant to the permutation of the columns. By the De Finetti theorem, for any exchangeable distribution there should be an underlying random measure such that conditioned on that the samples becomes conditionally independent given that random measure. For IBP, it was shown that BP is the underlying De Finetti measure. It would be very natural to extend this model to a categorical setting, where each entry of the matrix is not necessarily be 0 or 1, rather a set of integers in {0, 1,, Q} where Q is fixed a priori. Now we will define a prior on this infinite Q + 1)-nary random matrix. We call this stochastic process Categorical IBP or cibp. Categorical IBP. cibp is a direct generalization of IBP as we have discussed above. The main difference of this type of matrix is instead of binary entries we will have a any one of the Q + 1) categories denoted by the integers 1, 2,, Q. Let us we can describe cibp in the light of metaphorical language. The setting would be exactly same with N number of customers in an infinite buffet restaurant. Now assume each dish comes with a choice. For example, we can assume for the moment that Q = 3 and 1, 2 and 3 denotes the spice level - mild1), medium2) and hot3) of a particular dish. Now as the i-th customers walks in, he/she chooses the dish k with a particular spice level with probability m k. ) β kj +m kj i β +m k. ) where m k. = Q m kj and β = Q β j. m k. denotes the number of customers who have tasted the 126

127 dish k in total and m kj denotes the number of customers who have tasted the dish k with a specific spice level. Finally customer i tastes new dishes with spice level j determined by a draw from Poisson β j ) α ). This process - cibp can also be shown to generate an exchangeable β i distribution over Q + 1)-nary random matrices and we will explicitly found the underlying De Finetti measure of this exchangeable distribution - which happens to be the BD process in our case Extension of Finite Feature Model and the Limiting Case Now if one looks at the finite IBP model the natural prior to extend this model to a categorical version would be Dirichlet. Here we have a feature whose value can lie in the set {0, 1, 2,, Q} - so total Q + 1) choice. Treat this 0-th feature as possession of no feature. Rest of them {1, 2,, Q} can be seen as a choice for any particular feature. Q is fixed before hand. One typical matrix Z might look like the figure 5-3. Let the number of rows be N and number of columns be K. Consider each row as an object and each column as a feature. We will assume each object possesses q categories of feature k with a probability π kq. We will later derive expression when K. We will assume that the probability of matrix Z given Figure 5-3. A candidate matrix with Q = 3 has 4 categories namely 0, 1, 2 or

128 π = { π 1, π 2,, π K } where each π k is a vector that looks like {π k1, π k2,, π kq } such that Q π kj 1 for all k = 1, 2,, K. Also let π k0 denotes the probability of generating 0-th category of feature or no feature. Note that each z ik is now a Categorical/Discrete variable taking value in the set {0, 1,, Q}. Now let us compute the probability of generating such matrix pz π) = K k=1 N pz ik π k ) = K k=1 π m k1 k1 πm k2 k2 πm kq kq 1 π k ) m k0 where π k = Q π kj and m kj denotes the number of objects from total N objects possessing category j of feature k i.e m kj = N Iz ik = j). Let us define a Q + 1)-dimensional Dirichlet prior on π k with parameters { α, α,, α, 1} [there are total Q numbers of α ]. So clearly, K K K K p π k ) = Γ Qr + 1) [ Q Γr) ] Γs) π r 1 k1 π r 1 k2 π r 1 kq 1 π k ) s 1 where r = α K and s = 1. So the model can be specified as follows z ik π k Discrete π k ) π k Dirichlet α K, α K,, α K, 1) Note that, each z ik is independent of all other assignment and π k are generated independently. So pz) = = = K k=1 [0,1] Q) N ) pz ik π k ) p π k ) d π k K Γm k1 + α ) Γm K kq + α ) ΓN m K k + 1) k=1 K k=1 ΓN Qα K ) Γ Qα K + 1) [ Q Γm kj + α K ) ] ΓN m k + 1) ΓN Qα K ) Γ α K )) Q Γ Qα K + 1) Γ α K )) Q where m k = Q m kj. It is an exchangeable distribution because it depends only on count m kj. Expectation of the number of non-zero categories for an entry for a single column) by collapsing Q + 1) dimensional Dirichlet to a Beta variable considering only two possibilities - 128

129 feature category 0 and non-zero) K N π k ) pπ k )dπ k = N K Qα K 1 + Qα K NQα as K ) So Expectation of non-zero entry is bounded by NQα) as K. Equivalence Class. Left ordered Q + 1)-nary matrix or lof Z) is obtained by ordering the columns of Z from left to right by the magnitude of the Q + 1)-nary number i.e represented in base Q + 1)) expressed by that column taking first row as the most significant bit. The full History of feature k is referred by the decimal equivalent to the Q + 1)-nary number represented by the vector z 1k, z 2k, z Nk ). K h = number of features having history h K 0 = number of features for which m k = 0 K + = Q+1) N 1 h=1 = number of features for which m k > 0 So we have K = K 0 + K +. Let us denote the equivalent class by [Z]. By the lof ) notion we K! can see exactly K 0! K 1. mapped to the same left-ordered matrix. So now we derive! K Q+1) N 1! p[z]). p[z]) = ) K! K pz) = Q+1) N 1 Z [Z] h=0 K h! k=1 Γ Qα K + 1) [ Q Γm kj + α K ) ] ΓN m k + 1) ΓN Qα K ) Γ α K )) Q 5 5) So let us first derive the equation for pz). Note that, if number of objects having feature k is 0, then m k = 0 and all m kj = 0, which is denoted by K 0. K k=1 Γ Qα K + 1) [ Q Γm kj + α K ) ] ΓN m k + 1) ΓN Qα K ) Γ α K )) Q 129

130 = = = = = = = = Γ Qα + 1) Γ α )) Q K K ΓN + 1) Γ α )) Q K ΓN Qα ) K Γ Qα K Γ Qα K K + k=1 K + k=1 ) K K+ [ Γ Qα + 1) Q ] K Γm kj + α ) ΓN m K k + 1) Γ α + 1) ΓN + 1) ΓN Qα K ) + 1) ΓN + 1) ΓN Qα K ) )) Q K ΓN Qα ) K K+ K+ k=1 K ) Γ Qα K + 1) [ Q Γm kj + α K ) ] ΓN m k + 1) ) K ΓN Qα K )) K + Γ Qα K + 1)) K +ΓN + 1)) K + [ Γ Qα + 1) Q ] K Γm kj + α ) ΓN m K k + 1) Γ α Γ Qα K ΓN Qα N! ΓN+1+ Qα + 1) ΓN + 1) K ) Γ Qα K +1) N! N+ Qα K )! Qα K )! N! N+ Qα K )! Qα K )! K K K K ) K + k=1 K + k=1 N! N ) j + Qα K ) K [ Q )) Q K ΓN Qα ) K K+ K + k=1 k=1 N m k )! Γ α K )) Q ΓN Qα K ) ] Γm kj + α ) ΓN m K k + 1) Γ α )) Q K ΓN + 1) [ Q N! ] Γm kj + α K ) Γ α K ) [ Q ] Γm N m kj + k )! α K ) α Γ α K ) K α K N! N m k )! N! ) K α ) K QK+ + K [ α ) Q ) ]) Q mkj 1 + α K! ) K! k=1 N m k )! N! α K [ Q l=1 m kl 1 j + α ) ]) K Now from equation 5 5 we have, p[z]) = ) ) K K! N! α ) QK+ Q+1) N 1 N ) h=0 K h! j + Qα K K [ K + N m k )! Q m kl 1 j + α ) ]) N! K k=1 l=1 130

131 = ) ) ) K α QK+ K! N! Q+1) N 1 h=0 K h! K 0!K QK+ N ) }{{}}{{} j + Qα K }{{} [ K + N m k )! Q m kl 1 j + α ) ]) N! K k=1 l=1 }{{} 4 Now from 3) we have, Now, we know lim K where, H N = N N ) K N! N Qα j + ) = K N Qα ) j K ) K 1 1+ = exp{ x}. Using this we have x K { exp Qα 1 } = exp { QαH N } as K. j 1. From 4) we have, j K m kl 1 j + α ) K = m kl 1)! + α [other terms] K m kl 1)! as K So 4) becomes, K + k=1 N m k )! Q For 2), first we note that K 0 = K K + and ) l=1 m kl 1)! N! K! = K 0 + K + )! = K 0! K 0 + 1)K 0 + 2) K 0 + K + )) Using this we have, ) K! = K 0! K 0!K QK + = K 0! K K + + 1) K K + + K + )) = K 0! [ K+ ] K j + 1) K 0!K QK + = K+ K+ ) K j + 1) ) K j + 1) 0 when Q > 1 K QK + 131

132 This shows that probability of this matrix generation under Dirichlet prior goes to 0. This signifies that these will be all independent Beta processes as stated by Hjort in his paper. Note that, One of the important characteristics of this matrix is that it has more than 2 features in a column BD processbdp) and Categorical Indian Buffer Process cibp) The BD process was defined by Kim et al. [3] in the context of multistate data event history data from survival analysis. This new prior has been applied by them in a Bayesian semi-parametric regression model related to credit history data recently. This BD can be consider as an extension of Hjort s Beta process [18]. One of the important thing to note that in discrete time model, one can start with a natural conjugate prior for categorical data - which is Dirichlet. But if we take the limit then the cumulative intensity function becomes independent in limit theorem 5.1 in Hjort). To eliminate this property Kim et al. process this novel prior which has a desirable property that cumulative intensity functions are dependent in limit. Let us take BD distribution with parameters α 1, α 2, β 1, β 2,, β Q )). So instead of traditional Dirichlet we will take BD with those specific parameters. Our goal is to generate a matrix which would look like the following First let us write down the density function of Figure 5-4. A candidate matrix with Q = 2 has 3 categories namely 0, 1,

133 Q-dimensional BD, which is given by: Γβ ) Q Γβ i) where β = Q β i and x. = Q x i Symmetric Dirichlet Γα 1 + α 2 ) Γα 1 )Γα 2 ) x.)α 1 1 x.) α 2 In this case we will choose a symmetric Dirichlet with all parameters are equal to β Q x β i i and α 1 = α K and α 2 = 1. So the x. will follow a Beta density with parameters α, 1) and K,, x Q } will follow a Dirichlet with {β,, β} So the full generative model of this matrix x. x. { x 1 is z ik π k Discrete π k ) π k Beta-Dirichlet α K, 1, β,, β) ) Now we will derive the expression for finite K first and then we will obtain the expression when K. Let us denote Q m kj = m k. and also note that in order to m k. to be 0 all the m kj has to be 0. Using the same technique as before we have [ Q ] K ΓQβ)Γ α K pz) = + 1) Γβ + m kj) ΓN m k. + 1) Γ α + m K k.) Γ α k=1 K )Γβ))Q Γ α + N + 1) ΓQβ + m K k.) ΓN + 1) Γ α K = + 1) ) K K+ Γ α + N + 1) K K + ΓQβ) ) [ ] α Q K Γβ + m kj) ΓN m k. + 1) Γ α + m K k.) Γβ)) Q Γ α + N + 1) ΓQβ + m k=1 K k.) ΓN + 1) Γ α K = + 1) ) K Γ α + N + 1) ) K+ K Γ α + N + 1) ΓN + 1) Γ α + 1) K K K + ΓQβ) ) [ ] α Q K Γβ + m kj) ΓN m k. + 1) Γ α + m K k.) Γβ)) Q Γ α + N + 1) ΓQβ + m k=1 K k.) N! Γ α K = + 1) ) K Γ α + N + 1) K 133

134 = = = K + k=1 ΓQβ) ) [ ] α Q K Γβ + m kj) ΓN m k. + 1) Γ α + m K k.) Γ α + N + 1) K N! α )! ) K K α α + N)! K K K + k=1 ΓQβ) ΓQβ + m k. ) Γβ)) Q Γ α K + N + 1) Γ α K + 1) ΓN + 1) ΓQβ + m k.) ) K+ N! α )! ) K K α ) K+ α + N)! K K K + k=1 ΓN m k. + 1) ΓN + 1) N! α )! ) K K α ) K+ 1 α + N)! K Q K K + k=1 ΓN m k. + 1) ΓN + 1) Γ α + m K k.) ΓN m k. + 1) Γ α + 1) ΓN + 1) K Γ α + m K k.) Qβ Γ α + 1) Qβ K ) K+ [ Q ΓQβ) ΓQβ + m k. ) Γ α K + m k.) Γ α K + 1) Qβ)! Qβ + m k. 1)! ] Γβ + m kj ) Γβ) [ Q [ 1 β ] Γβ + m kj ) Γβ) Q ] Γβ + m kj ) Γβ) Continuing in this manner we have, = = ) K 1 α ) ) K+ K+ 1 N j + α ) K Q K K + k=1 K + k=1 N m k. )! N! 1 N j + α K ) ) K α K N m k. )! N! α K + m k. 1)! α K )! Qβ)! Qβ + m k. 1)! [ mk. 1 ) K+ 1 Q ) K+ [ 1 β j + α ) ] [ 1 K mk. 1 j + Qβ) 1 β Q Q ] Γβ + m kj ) Γβ) ] Γβ + m kj ) Γβ) Lets us now take K, then we have p[z]) = lim K K + k=1 K! K 0! Q+1) N 1 h=1 K h! [ mk. 1 N m k. )! N! ) j + α K 1 N j + α K ) ) K α K ) ] [ 1 mk. 1 j + Qβ) ) K+ 1 Q 1 β Q ) K+ ] Γβ + m kj ) Γβ) 134

135 ) K! α Q = lim )K + K K 0!K K+ Q+1) N 1 K + k=1 N m k. )! N! [ mk. 1 ) ) K 1 N h=1 K h! j + α ) K j + α ) ] [ 1 1 Q K mk. 1 j + Qβ) β ] Γβ + m kj ) Γβ) Now we know K! K 0!K 1 as K K+ ) K 1 N j + α ) exp αh N ) as K K [ mk. 1 So in the limit it becomes, ) α Q )K + Q+1) N 1 h=1 K h! j + α K ) ] m k. 1)! as K { K + [N ] mk. )!m k. 1)! exp αh N ) N! k=1 [ ] } 1 1 Q Γβ + m kj ) mk. 1 j + Qβ) β Γβ) Note: The one β in the denominator is to compensate for the fact that the numerator corresponding to the first row of a column starts contributing from β +1); so in order to make it from β, we need this constant. It will be We now give one example for the first column of the given matrix Q = 2) shown in 5-4. Thus we can verify the equation. 1 2 β + 1 2β β 2β β + 2 2β

136 Asymmetric Dirichlet Let Q different parameters for Dirichlet be {β 1, β 2,, β Q }. The the only factor that will change is the following: ΓQβ) ΓQβ + m k. ) 1 β Q Γβ + m kj ) Γβ) and it will change to Γβ ) Γβ + m k. ) 1 β fk Q Γβ j + m kj ) Γβ j ) where β = β 1 + β β Q ) and β fk can be any of β j s. β fk corresponds to the Dirichlet parameter the first symbol in {1, 2,, Q} that got generated in the k-th column. Now multiplying with a factor β f k β and among them K f ) i Now each K i = Q f =1 K f ) i appropriately and assume that i-th customer generates K i dish are number of dishes with choice f. So previous we had K + = N K i.. So K + = N Q f =1 K f ) i total new dish generated with choice f. We have for finite K ) K 1 α ) K+ Q N j + α ) K K f =1 [ 1 1 mk. 1 j + β ) βf β β fk ) ) K f K+ + Q k=1 = Q f =1 K f ) +, where K f ) + number of { N m k. )! N! Γβ j + m kj ) Γβ j ) ] } [ mk. 1 So now the limiting distribution became from, ) { α Q )K + K + [N ] mk. )!m k. 1)! exp αh Q+1) N N ) 1 h=1 K h! N! k=1 [ ] } 1 1 Q Γβ + m kj ) mk. 1 j + Qβ) β Γβ) to this new distribution p[z]) = Q α)k+ f =1 β f β ) K Q+1) N 1 h=1 K h! f ) + exp αh N ) j + α ) ] K 136

137 { [N mk. )!m k. 1)! K + k=1 N! ] [ 1 mk. 1 j + β ) 1 β fk Q ]} Γβ j + m kj ) Γβ j ) Now let us rewrite this expression in a better form. So in the limit the probability of generating an matrix is [ Q αβ f β ) K f ) + ) ) αβf N K exp H f ) β N f =1 K + f ) k=1 i! { [N mk. )!m k. 1)! N! ] [ 1 mk. 1 j + β ) 1 β fk Q ]} ] Γβ j + m kj ) Γβ j ) where K f ) i customer. is the number of new dishes with choice f {1, 2,, Q} sampled by i-th Connection Now, let us write down the metaphorical quantities related to cibp- Let z ik = 0 if customer i does not taste dish k and z ik = j if customer i taste dish k with choice j Customer i tastes dish k with choice j with probability mk. ) ) βj + m kj i β + m k. where m kj is the previous number of customers tasted dish k with choice j and m k. is the previous number of customers tasted dish k with any choice From here we know that customer i does not taste dish k with probability Q ) m kj 1 = 1 m k. i i ) Customer i draws new dishes with choice j from Poi β j ) α totaling Poiα) new dishes. β i The De Finetti mixing distribution behind categorical IBP is Beta Dirichlet BD) process. When we say X i CategoricalD), we mean that X i is a infinite vector with one entry being any integer between 0 and Q. 137

138 Lemma Let D BDc, D 0, β 1,, β Q )) and let X i D CaPD) for i = 1, 2, n be n independent Categorical process draws from D. The posterior distribution of D after observing {X i } n is still a BD process with the following parameters D {X i } n c BD c + n, c + n D n n X i., β 1 + X i1,, β Q + c + n )) n X iq Lemma Combining the previous Lemma and using the fact that PX n+1 {X i } n ) = E D {X i } n PX n+1 D) we have the following predictive distribution formula where each a i = X n+1 {X i } n CaP a 1, a 2,, a Q, 1 a ) where a = c c+n ) β i β )D 0 + j m n.j c+n β i +m nij β +m n.j )δ wj ) for all i = 1, 2,, Q Here m n,i,j is the number of customers among X 1,, X n having tried dish w j with category i and m n,.,j is the number of customers among X 1,, X n having tried dish w j with any category. If D 0 Ω) = γ, then drawing new dish with choice j generates Poi c c+n ) β j γ β ) new dishes of that kind with that particular choice BD-NM Conjugacy Negative Multinomial Process NMP) We have seen that BD distribution is conjugate to Negative Multinomial NM) distribution. We can similarly define a NM Process Ȳ with a underlying base measure Ā = k=1 p kδω k and with another parameter r 0. Like before Ā can be continuous or discrete or mixture of both. So we will denote Ȳ NMPr 0, Ā). In NM, the 0-th position is a very special position so p can be used to draw a Q-dimensional process i.e. Ȳ = k=1 m kδ ωk where m k = m k0, m k1,, m kq ). In CRM notation we can write, Q a i Ȳ NMPr 0, Ā) = Ȳ = m k δ ωk k=1 where m k ind NMr 0, p) 138

139 As we can see this process will be appropriate where we not only need just existence or non-existence of the latent features but also need the count of them. In the later section we will explicitly show the Conjugacy of NMP and BDP for both the cases when the underlying Poisson measure νdω, d p) is finite and νdω, d p) goes to. Also we will give an alternative construction for NMP via marked Poisson process Conjugacy for NMP and BDP CRM Formulation m u 11 m v 11 m v 21 m u 21 m v 31 m u 12 m v 12 m v 22 m u 22 m v 32 Ȳ NMPĀ) Ω Z + ) 2 η 11 η 12 ξ 11 ξ 21 η 21 ξ 12 ξ 22 η 22 ξ 31 ξ 32 Ā BDPθ,β,{γ j } Q,G f,g o ) u 1 v 1 v 2 u 2 v 3 Ω S 2 o Figure 5-5. BD-Negative Multinomial process with Q = 2 Let us write down the full CRM specification for BDP and NMP where we will have only the discrete part. NMP will be denoted by NMPr 0, Ā). It has a two parameters, r 0 and the base or hazard measure Ā, which is discrete and has atoms that takes values in S Q o. We will focus particularly on the model where Ā is drawn from a BD process whose draw is almost surely discrete. The entire model is the following Ȳ NMPr 0, Ā) Ā BDPθ, β, γ 1,, γ Q ), G) where θ is the concentration parameter, β is the mass parameter, γ 1,, γ Q ) is the Dirichlet parameter and G is the base measure such that GΩ) = β. 139

140 The base measure Ā is discrete and has atoms that takes values in [0, 1] Q. We will focus particularly on the model where Ā is drawn from a BD process whose draw is almost surely discrete. Hence we write, Ā = p k δ ωk = k=1 p k1, p k2,, p kq ) T δ ωk k=1 Note that, Q p kj 1 for all k = 1, 2,. Now {p kj } Q can be treated as a prior probability distribution for a negative multinomial distribution with Q components. We say that the random measure Ȳ is drawn from a NM process if Ȳ = m k δ ωk = k=1 m k1,, m kq ) m k1,, m kq ) T δ ωk where k=1 ind Negative Multinomialp k1,, p kq ) for k = 1, 2,. So The NMP lies on the space Ω Z + ) Q as shown in figure 5-5. Now suppose BDP Ā has the parameters θ > 0, β > 0, γ 1,, γ Q ) and base measure G. We assume G is any arbitrary CRM with three parts G = G d + G f + G o according to [30]. In all proofs we have assumed G d = 0. So we will take G = G f + G o. In later section, we have assumed that G itself is a draw from a BP G = k b kδ ωk in the hierarchical model. Ā BDPθ, β, {γ f,j } Q, {γ o,j} Q, G f, G o ) and Ȳ NMPĀ). We refer the overall process as the Beta-Dirichlet Negative Multinomial Process BDNMP). BDNMP Conjugacy Using Alternative Parametrization for BDP in the Base Measure G). Theorem Assume the conditions for the above theorem and consider N conditionallyindependent draws from NMP Ȳ n = with L m f,n,l δ ul + l=1 Ā BDP M m o,n,i δ vi i.i.d. NMPĀ), for n = 1, 2,, N [ ] ) θ, β, {γ oj } Q, u, ρ, σ, {γ fj } Q, G o 140

141 As before, G G f, G o ) [u, ρ, σ], G o ). {ρ l } L l=1 and {σ l} L l=1 are all > 0. {γ fj} Q are the Dirichlet parameters corresponding to fixed part and are assumed to be same for all fixed atoms. Similarly {γ oj } Q are the Dirichlet parameters corresponding to random part and are assumed to be same for all random atoms. Then [ ] ) Ā Ȳ 1,, Ȳ N BDP θ post, β post, {γ post oj } Q, u post, ρ post, σ post, {γ post fj } Q, G post o with θ post = θ + r 0 N, β post = β θ θ+r 0 N and G post o = G o. There will be total L + M fixed atoms, {u post l } = {u l } L l=1 {v i} M. ρpost, σ post will be such that for l = 1, 2,, L ρ post l = ρ l + σ post l N n=1 q=1 = σ l + r 0 N Q m f,n,l,q and and for i = 1, 2,, M ρ post L+i = σ post L+i N n=1 q=1 = θ + r 0 N Q m o,n,i,q and Also we have, γ post f,l,j = γ fj + γ post o,i,j = γ oj + N n=1 N n=1 m f,n,l,j m o,n,i,j l = 1,, L and j = 1, 2,, Q i = 1,, M and j = 1, 2,, Q Proof. This will be immediate from the following Theorem Formal Proof of Conjugacy of NMP for BDP In this setup, we are generating data from a Q-dimensional Negative Multinomial process NMP). The prior is a Q-dimensional Beta-Dirichlet process BDP). Now we know from BDP construction that it can be written as k=1 p kδ ωk such that for all k, Q p kj 1. We will denote p k = Q p kj and p k0 = 1 Q p kj. analogous to NBP process with BP prior). 141

142 This proof can be constructed as an extension of the proofs given in [20] and [3], which were obtained mainly after Kim s famous proof [19]. Our proof will be similar to those. We are giving it for the readers to be able to see the complete story. Here we are talking about multivariate non-decreasing process with independent increments, which is a multidimensional Lévy process. Let Ā be a Q-dimensional Lévy process on [0, ]. Define a random measure µ as µ[0, t], D) = s t I [ Ās) D 0] where D B[0, 1] Q. According to [70], µ is a Poisson measure. Let ν ) = E[µ )] which is the compensator of the Lévy process Ā. Define a set U by setting U = {u ν{u} [0, 1] Q ) > 0}. Then U is the finite set of times of fixed discontinuity of Ā. Let us denote this set by U = {u 1, u 2,, u L }. Let the distribution function for jumps in the fixed points of discontinuity u j is denoted by G j ). Now ν can be decomposed as ν[0, t] D) = dl t z) + dg j z) D t j t D t = df s z)ds + dg j z) 0 D u j t D t = f s z)d z ds + dg j z) 0 D u j t D { t } where dl t z) = f s z)ds d z and df s z) = f s z)d z Note that for compound Poisson process f s z) = g s z) λs) where λs) = f s z) d z and g s z) = f s z) [0, ] Q λs) 0 when λs) < With finite number of fixed point of discontinuity a compound Poisson process is sometimes called extended compound Poisson process. 142

143 For a general case when λs) =, we have to find the limit of a sequence of compound Poisson processes with finite λs). Theorem Let Ā prior = k=1 p kδ ωk be a discrete, CRM on [0, 1] Q with atom locations in [0, 1]. Let p. = Q p j. Suppose it has the following components. There is no deterministic component. The ordinary component is generated from a Poisson point process with intensity ν c dω, d p) = ν c d p)dω such that ν c is absolutely continuous and ν c [0, 1] Q ) <. In particular, the Q-dimensional weights are the p axes and the atom locations are in the ω axis. There are L fixed atoms at locations u 1, u 2,, u L [0, 1]. The Q-dimensional weight of the l-th fixed atom is a random variable with distribution dg l. Draw an NMP Ȳ with input measure Ā prior and parameter r 0. Let us assume there is only one non-zero atom of Ȳ and let { m 1, s 1 } be the pair of observed data with corresponding location. Note that m 1 = {m 11,, m 1Q } is Q-dimensional vector. The posterior process for the BDP given Ȳ is given by a CRM Ā post with the following components. There is no deterministic component. The ordinary component is generated from a Poisson point process with intensity There are three types of fixed atoms. 1 p.) r 0 ν c dω, d p) = 1 p.) r 0 ν c d p) dω 1. There is old, repeated fixed atom. If u l = s 1, then there is a fixed atom at u l with weight density Q ) 1 1 p. ) r 0 dg l p) where W rf W rf p m 1j j is the normalizing constant Q W rf = p [0,1] Q p m 1j j ) 1 p. ) r 0 dg l p) 143

144 2. There are old, unrepeated fixed atoms. If u l s 1, then there is a fixed atom at u l with weight density 1 W uf 1 p. ) r 0 dg l p) where W uf is the normalizing constant W uf = 1 p. ) r 0 dg l p) p [0,1] Q 3. There is a new fixed atom. If s 1 / {u 1,, u L }, then there is a fixed atom at s 1 with weight density Q ) 1 1 p. ) r 0 ν c d p) W new p m 1j j where W new is the normalizing constant Q W new = p [0,1] Q p m 1j j ) 1 p. ) r 0 ν c d p) Proof. We will assume the existence of a NMP Y with cumulative intensity function Ā prior using a marked Poisson process V. Let V be a marked Poisson process s 1, m 1 ), s 2, m 2 ), where the process I s i t) is the Poisson process with the cumulative intensity function Ā prior. = Q Aprior j. Let Ȳ = m 1, m 2, ) = m 11,, m 1Q ), m 21,, m 2K ), ). Without loss of generality let consider only [0, 1] interval in Ω. Let s K = {s 1, s 2,, s K } be the set of times for jump and let m K = { m 1,, m K } be the corresponding marks. Let B, Σ B ) set of completely random measures on [0, 1] with weights in [0, 1] Q and its associated σ-algebra. Let M, Σ M ) be the set of completely random measures on [0, 1] with atom weights in Z + ) Q and its associated σ-algebra. For any set B Σ B and M Σ M, QB; C) be a probability measure on B by the proposed posterior distribution. Finally, let us denote the marginal distribution of C by P C. It suffices to prove that PB M) = C M QB; C)dP C 5 6) Let us first consider the case when ν c [0, 1] Q ) = [0,1] Q νc d p) = λ < homogeneous Poisson). Define probability density function as νc d p) λ where ν c d p) ν c dp 1 dp 2 dp Q ). 144

145 We can write Ā prior as Ā prior dω) = L η l δ ul dω) + l=1 K ξ i δ vi dv) Note that, ξ i = {ξ i1,, ξ iq } and η l = {η l1,, η lq }. Here K is the number of atoms in the ordinary component of Ā prior. So, total atoms in Ā prior is L + K and total number of atoms in the counting measure with Ā prior should be at most L + K. Atom locations are {v i } K for ordinary component and {u l } L l=1 for fixed point of discontinuities which we assumed to be finite). {s 1 } {u l } L l=1 {v i} K. Q-dimensional atom s weight at fixed point {u l} L l=1 are { η l } L l=1 and at ordinary component location {v i} K are { ξ i } K. Let v K = {v 1, v 2,, v K } and ξ K ; η L ) = { ξ 1,, ξ K ; η 1,, η L }. Let λ = ν c [0, 1] Q ), which is finite by assumption. Then number of atoms in the ordinary component is Poisson distributed. K Poissonλ) {ξ} K are i.i.d distributed random variables with values on [0, 1]Q and each has density νc d p) λ. According to [70], it suffices to consider only the sets B and G which are of the following form B = {K = k, v k ṽ k, ξ k ξ k ), η L η L } M = {T = 1, S 1 s, m 1 = m} where T is the number of points generated by NMP and that is 1 in our case. For any given vector ṽ k [0, 1] k, ξ k [0, 1] k Q and η L [0, 1] L Q, s [0, 1] and m Z + ) Q where v n ṽ n is defined as v i ṽ i i = 1,, n). Similarly ξ n ξ n ) and η L η L are defined component-wise for a vector. Here we have considered the case when Ȳ, the NMP has only one observation, extension to the case for greater than 1 is straight-forward as pointed out in [19]. So in the random measure A prior, we consider a set with a fixed number n of ordinary component atoms and have fixed upper bounds ṽ i, ξ i and η l on the location of the ordinary component atoms, the weights of the ordinary component atoms and weights of the fixed 145

146 atoms, respectively. For the counting measure Ȳ, we restrict to a single atom at s and the mark of that atom is m, which can belong to the Z + ) Q Prior Part First let us calculate PB M). We can write, Now we know, } P {K = k, v k ṽ k, ξ k ξ k ), η L η L { } = P {K = k} ξ k ξ k ), η L η L )) v k, K = k P { dv k K = k } 5 7) v k ṽ k P P{K = k} = λk k! exp λ) 5 8) Also the location of those atoms v i given their total number k is distributed as, note that the element in the set {v i } k are ordered in increasing order of time ). P{v n ṽ n N = n} = n! λ n) ṽ 1 = k! 0 ṽ1 ṽ2 ṽ2 ṽ 1 ṽn 0 ṽ k 1 ṽ 1 ṽk ṽ n 1 n λ dv i k dv i 5 9) Given the atom location and their total numbers, the weight distribution of the whole atom set random and fixed) will be of the following = { ) } P ξ k ξ k ) η L η L ) K = k; v k = ṽ k [ k ] [ ξ i L ν c ] d ξ i ) η l dg l η l ) λ 0 l= ) 5 11) Note than {u l } L l=1 are unique and {v i} k are almost surely unique. We also have T = 1 which can come either from fixed atom part or Poisson process part. Suppose, K = k with v k = ṽ k, ξ k = ξ k and η L = η L. The direct calculation yields after breaking into two separate cases) P {T = 1, S 1 s, m 1 = m Ā prior } 146

147 = k ΞK, L, v k, ξ k, η L, j, v i ) I v i s) + L ΞN, L, v k, ξ k, η L, m, u l ) I u l s) l=1 The probability that the non-zero vector occurs at a particular atom is the probability that the non-zero vector appears at this atom and zero counts appear at all other atoms. In this context, let us now define the function Ξ ). { k ΞK, L, v k, ξ k, η L, m, s) = H m; r0, ξ i ) ) I v i =s) H 0; r 0, ξ i ) ) } I v i <s) { L H m; r 0, η l )) I u l =s) H 0; r 0, η l ) ) } I u l <s) l=1 where ξ i. = Q q=1 ξ iq and η i. = Q q=1 η iq and H is Negative Multinomial NM) distribution function. Combining equations 5 7, 5 8, 5 9, 5 11 and 5 12 from above, we have P{B M} PM C)dPC) C B { k = exp{ λ} ΞK, L, v k, ξ k, η L, m, v i ) I v i s) Ω v Ω ξ Ω η k ) L ) } ν c d ξ i ) dg l η l ) l=1 k dv i) { L + exp{ λ} ΞN, L, v k, ξ k, η L, m, u l ) I u l s) l=1 Ω v Ω ξ Ω η k ) L ) } ν c d ξ i ) dg l η l ) dv i) k l=1 5 12) where Ω v = {v n [0, 1] k : v k ṽ k and v 1 v k }, Ω ξ = {w [0, 1] k Q : w ξ k } and Ω η = {w [0, 1] L Q : w η L } Proof will be complete if we show that QB; C)dP C M C is same as equation For that we will calculate the product and then take the integration to show that they are indeed equal. 147

148 Induced Measure Conditioned on T = 1, S 1 = s and m 1 = m, we have QB; C) = P{K = k, v k ṽ k, ξ k ξ k ), η L η L ) T = 1, S1 = s, m 1 = m} We have precisely two cases to consider - either the atom of NMP is at the same location as a fixed atom of the prior random measure, say u l or it is at a different location. Case I:. Let us consider the first case where s = u l. As before, the number of atoms in the ordinary component is Poisson distributed with mean equal to the total Poisson point process mass So we have, λ post = 1 p= 0 1 p )ν c d p) P{K = k T = 1, S 1 = u l, m 1 = m} = exp{ λ post } λ post) k k! 5 13) Let us look at the distribution of the ordinary component atoms Also, P {v k ṽ k K = k, T = 1, S 1 = u l, m 1 = m} ṽ1 ṽ2 ṽk k = k! λ post ) k λ post dv i = k! 0 ṽ 1 ṽ k 1 ṽ1 ṽ2 0 ṽ k 1 ṽ 1 ṽk k dv i 5 14) P = { ξ k ξ k, η L η L ) v k = ṽ k, K = k, T = 1, S 1 = u l, m 1 = m} [ k ] ξ i 1 ξ i.) ν c d ξ i ) 0 λ post [ L η l 0 H m; r 0, η l )) I l=l ) H 0; r 0, η l ) ) ] I l l ) dgl η l ) 1 l=1 0 H m; r 0, η l )) I l=l ) H 0; r 0, η l ) ) I l l ) dgl η l ) 5 15) Putting together the equations 5 13, 5 14 and 5 15 we have, P = {K = k, v k ṽ k, ξ k ξ k ), η L η L ) T = 1, S 1 = u l, m 1 = m} 1 exp{ λ post } ΞK, L, v k, ξ k, η L, m, u l ) W Ul ) Ω v Ω ξ Ω η 148

149 k k ) L ) dv i) ν c d ξ i ) dg l η l ) 5 16) l=1 where W Ul ) = L l=1 1 0 H m; r 0, η l )) I l=l ) H 0; r 0, η l ) ) I l l ) dgl η l ) Case II:. For the case where s / {u 1, u 2,, u L }, conditioned on S 1 = s and K = k, there exists a i {1, 2,, k} such that v i = s. Let us also assume that v i is the o-th smallest order statistics as {v i } k are a.s. unique). So for this case, we can write = P{K = k, v k ṽ k, ξ k ξ k ), η L η L ) T = 1, S 1 = s, m 1 = m} k { P{K = k, v o = s T = 1, S 1 = s, m 1 = m} } P{v k ṽ k, ξ k ξ k ), η L η L ) K = k, T = 1, S 1 = v o = v i, m 1 = m} 5 17) Now note that number of atoms on either side of v i is Poisson distributed. So we have, P {K = k, v o = s T = 1, S 1 = s, m 1 = m} = exp{ λ post v i } λ postv i ) o 1 o 1)! exp{ λ post 1 v i )} λ post1 v i )) k o k o)! 5 18) Let us look at the distribution of the atoms on either side of v i normalized appropriately) P {v k ṽ k K = k, T = 1, v o = S 1 = s, m 1 = m} ṽ1 ṽ2 ṽo o 1 ) = o 1)! λ post ) o 1) dv i λ post 0 ṽ 1 ṽ o 1 v i ṽo+1 ṽo+2 ṽk k ) k o)! λ post ) k o) dv i λ post ṽ o ṽ o+1 ṽ k 1 1 v i i=o+1 ṽ1 ṽ2 ṽo o 1 ) dv i = o 1)! 0 ṽ 1 ṽ o 1 v i ṽo+1 ṽo+2 ṽk k ) dv i k o)! ṽ o+1 1 v i ṽ o ṽ k 1 i=o ) 149

150 Also the weight distributions will be = P{ ξ k ξ k ), η L η L )) v k ṽ k, K = k, T = 1, v o = S 1 = vi, m 1 = m} k ξ i 0 H m; r0, ξ i ) ) I i=i ) H 0; r 0, ξ i ) ) I i i ) ν c d ξ i ) H m; r0, ξ i ) ) I i=i ) H 0; r 0, ξ i ) ) I i i ) νc d ξ i ) 1 0 η l [ L l=1 0 H 0; r 0, η l ) dg l η l ) 1 0 H 0; r 0, η l ) dg l η l ) ] 5 20) Combining equations 5 22, 5 18, 5 19 and 5 20 we have, P{K = k, v k ṽ k, ξ k ξ k ), η L η L ) T = 1, S 1 = v i, m 1 = m} = 1 exp{ λ post } ΞK, L, v k, ξ k, η L, m, v W V Ω o) i ) v Ω ξ Ω η k ) L ) ν c d ξ i ) dg l η l ),i o dv i) k l=1 5 21) where [ 1 W V = H m; r0, ξ i ) ) [ L I i=i ) ν c d ξ i )] 0 l=1 1 0 H 0; r 0, η l ) dg l η l ) because the other part is getting canceled from part of the equation Also we have, ] Ω o) v = {v k [0, 1] k : v k ṽ k and v 1 v o 1 v o = s v o+1 v k }. Putting together equation 5 16 and 5 21 we get, = + I P{K = k, v k ṽ k, ξ k ξ k ), η L η L ) T = 1, S 1 = s, m 1 = m} { L 1 I u l = s) exp{ λ post } W Ul) l=1 [ k ) L ) ]} ΞK, L, v k, ξ k, η L, m, u l ) ν c d ξ i ) dg l η l ) Ω v Ω ξ Ω η ) k s / {u l } L l=1 Ω o) v Ω ξ Ω η [ { 1 W V exp{ λ post } ΞK, L, v k, ξ k, η L, m, v i ) dv i) k l=1 150

151 k k ) L ) ]} dv i) ν c d ξ i ) dg l η l ),i o l=1 5 22) Marginal Distribution of Ȳ Via Marked Poisson Process Let us consider the definition of a Marked Poisson ProcessMPP) to find out the marginal. We are starting from a BD Process BDP) prior. So, let ω, p) forms a Poisson Process PP) on Ω [0, 1] Q with mean measure ν. Now mark each Q + 1)-dimensional Q from p and 1 from ω) point ω, p) with a random variable Z whose value lies in a space V with a transition ) probability P p, ). Then ω, p, Z) form a Poisson process on Ω [0, 1] Q V with mean measure ν P p, Z). In NMP process context, Z take values in B = {0, 1, 2, } {0, 1, 2, } = ) {0, 1, 2, } Q. Now, here {ω, p, Z} is a PP on Ω [0, 1] Q B with mean measure νdω, d p) P p, Z). We also know there is a counting measure associated with every PP. Let this counting measure be Ndω, d p, Z). Note that in our case, the transition probability P p, ) is the Negative Multinomial NM) probability distribution on space B. Let us denote C = B { 0} to exclude the zero value from our coupled process. So P p, C) is the probability of the set C w.r.t Negative Multinomial distribution with parameters r 0, p), so P p, C) = 1 1 p. ) r 0 ) [as we are considering only one point), where p = Q p j. We want the counting measure NA, [0, 1] Q, C) is a random variable with probability where A Ω) νa, [0, 1] Q ) P p, C) = νa, d p) P p, C) [0,1] Q = νdω, d p) P p, C) A [0,1] Q = µa) 1 1 p ) r ) 0 νd p) = λ A let) [0,1] Q We want to first compute the probability distribution for PK = 1, m i C, s 1 ŝ 1 ). For that we are trying to compute the density first and then integrating that out to get this distribution. Then density will look like as below, remember U ɛ is a set which is v 1 ɛ 2, v 1 + ɛ 2 ) 151

152 where v 1 is the point in Ω axis generated by the random PP. [ ] [ ] P NΩ U ɛ ), [0, 1] Q, C) = 0 P NU ɛ, [0, 1] Q, C) = 1 lim ɛ 0 ɛ Now, from Poisson distribution we know [ ] P NΩ U ɛ ), [0, 1] Q, C) = 0 = exp{λ Ω Uɛ )} and [ ] P NU ɛ, [0, 1] Q, C) = 1 = λ Uɛ exp{λ Uɛ } Also, we are generating the value m 1 = m at the point v 1, which is one of the points from the set C = B { 0} = {1, 2, 3, } Q. So the probability is given by H m r 0, p) P p, C) conditional probability rule). As we finally want to compute the probability distribution for PK = 1, m 1 = m, s 1 ŝ 1 ), so we have to calculate the following density [ ] [ P NΩ U ɛ ), [0, 1] Q, C) = 0 P NU ɛ, [0, 1] Q, C) = 1 lim ɛ 0 ɛ [ exp { λω Uɛ )} ] [ ] [λ Uɛ exp { λ Uɛ }] H m r0, p) P p, C) = lim ɛ 0 1 = lim ɛ 0 ɛ ] H m r0, p) P p, C) { ɛ [ { }] exp µω U ɛ ) νd p)p p, C) [0,1] Q [ { }] [ ] } H m r0, p) µu ɛ ) νdb)p p, C) exp ɛ νd p)p p, C) [0,1] Q [0,1] Q P p, C) [ { }] [ ] [ ] H m r0, p) = exp νd p)p p, C) νd p)p p, C) [0,1] Q [0,1] Q P p, C) [ { }] [ ] = exp νd p)p p, C) H m r 0, p) νd p) [0,1] Q [0,1] [ { Q = exp 1 1 p ) r ) }] [ ] 0 νd p) H m r 0, p) νd p) [0,1] Q [0,1] [ { }] Q [ ] = exp νd p) + 1 p ) r 0 νd p) H m r 0, p) νd p) [0,1] Q [0,1] Q [0,1] [ ] Q = e λ+λ post) H m r 0, p) νd p) [0,1] Q by 152

153 Here we have used that the measure µω U ɛ ) = 1 ɛ and µu ɛ ) = ɛ and ɛ 0. So we have, PK = 1, m 1 = m, s 1 ŝ 1 ) = ŝ1 ω=0 [ ] e λ+λ post) H ˆ m 1 r 0, p) νd p) dω [0,1] Q Now let us substitute the values for p ξ i and ν c ν. Note that this atom does not come from the fixed point set. So in order to get the correct marginal we have to multiply this term with L l=1 H 0; r 0, η l ). So finally we have = P{T = 1, S 1 s, m 1 = m} [ L 1 [ H 0; r 0, η l ) ) s dg l η l )] e λ+λ post) l=1 0 = exp{ λ} exp{λ post } s ω=0 Similarly for the other case we have, W V dω ω=0 ) ] H ˆ m 1 r 0, ξ i ) ν c d ξ i ) dω [0,1] Q P{T = 1, S 1 = u l, m 1 = m} = exp{ λ} exp{λ post } W Ul ) Checking the Integration For simplicity, let us use the following notation, { k k ) L )} Z 1 = ΞK, L, v k, ξ k, η L, m, u l ) dv i) ν c d ξ i ) dg l η l ) Ω v Ω ξ Ω η l=1 { k k ) L )} Z 2 = ΞK, L, v k, ξ k, η L, m, v i ) dv i) ν c d ξ i ) dg l η l ) Ω v Ω ξ Ω η l=1 { k k ) L )} Z o) 2 = ΞK, L, v k, ξ k, η L, j, v i ) dv i) ν c d ξ i ) dg l η l ) Ω o) v Ω ξ Ω η,i o l=1 The joint distribution prior) is given by L.H.S = exp{ λ} L I u l s) Z 1 + exp{ λ} l=1 k I v i s) Z ) 153

154 Prior to integration, the induced posterior multiplied fixed with fixed part and random with random part) with corresponding marginal is given by [ L ] 1 [exp{ λ} ] R.H.S = I u l = s) exp{ λ post } Z 1 exp{λpost} W Ul) W Ul) l=1 [ k ] [ + I v i = s) 1 s ] exp{ λ post } Z o) 2 exp{ λ} exp{λ post} W V dω W V = exp{ λ} = exp{ λ} L I u l = s) Z 1 + exp{ λ} l=1 L I u l = s) Z 1 + exp{ λ} l=1 Now taking the integration of equation 5 24 we have, [ s L exp{ λ} I u l = s) Z 1 + exp{ λ} 0 = exp{ λ} l=1 ω=0 k s I v i = s) Z p) 2 dω ω=0 k I v i = s) Z ) L I u l s) Z 1 + exp{ λ} l=1 ] k I v i = s) Z 2 k I v i s) Z ) Now we can see the equation 5 25 is exactly equal to the equation Thus we have proved the equation 5 6 for Beta-Dirichlet-Negative-Multinomial Process BDNMP) The Case When ν = Theorem The previous theorem can still be applied when the Poisson intensity measure is not finite in the interval [0, 1], i.e. ν c [0, 1] Q ) =, but rather satisfies a weaker condition [0,1] Q p. ) ν c d p) < 5 26) The sequence of compound Poisson process Ā n prior can be obtained as the following Ā n prior,jt) = { Ā prior,j s)i A prior,. > 1 } n s t 154

155 The idea of the proof is to show that the limit of the sequence of the compound Poisson process that has finite intensity will converge to the correct process. Here is a theorem whose proof can be constructed from the Theorem 3 found in [3]. Theorem Let Ā prior,n be a CRM with a finite set of fixed atoms in [0, 1] and with the Poisson process intensity νn c. ν c satisfies equation Let Ȳ n be drawn as a Negative Multinomial process with parameter r 0 and Ā prior,n. Let Ā prior be a CRM with Poisson process intensity ν c and let Ȳ be drawn as a Categorical process with parameter r 0 and Ā prior. Then, ) d Āprior,n, Ȳ n Ā prior, Ȳ ) Using this previous Theorem and Theorem 3.2 in [19], we can show that even with this weaker condition, the conjugacy of NMP and BDP still holds Beta-Dirichlet-Negative Multinomial Process as a Marked Poisson Process Let us state another very useful theorem called Marking Theorem in the context of Poisson point process theory. The proof can be found in [17]. Theorem Let Π be a Poisson Process on S with mean measure µ. Suppose with each point X of the random set Π, we associate a random variable m X the mark of X ) taking values in some other space M. The distribution of m X may depend on X but not on the other points of Π, and m X for different X are independent with respective distributions px, ). The pair X, m X ) can then be regarded as a random point X in the product space S M. The totality of points X forms a random countable subset Π = {X, M X ) X Π} of S M. Now this random subset Π is a Poisson process on S M with mean measure µ given by µ A) = µdx)px, dm) x,m) A 155

156 The beta distribution may also be parametrized in terms of its mean µ 0 < µ < 1) and sample size ν = α + β ν > 0). α = µ ν where ν = α + β > 0 β = 1 µ)ν The similar parametrization has also been used by Hjort in his paper. Recall that a BP draw B BPc, A 0 ) can be considered as a draw from the Poisson process with mean measure ν BP where CRM B is defined on product space [0, 1] with the Lévy measure ν BP dp dω) = cp 1 1 p) c 1 dp A 0 dω) where c > 0 is the concentration parameter or concentration function if c depends on ω. A 0 is the base measure with A 0 Ω) = γ, which is called mass parameter. For BD it is defined on product space [0, 1] K Ω with the Lévy measure ν BD d p dω) = Γ K β j) Γβ 1 ) Γβ K ) c pβ p K β K 1 p.) K β j 1 p.) c 1 d p A 0 dω) where p. = K p j 1 and p denotes that it is a vector-valued random process. In a very similar fashion, we can see that a BD draw D BDc, A 0, β 1,, β K ) where {β i } K are Dirichlet parameters which can be function of ω ). We can mark a random point ω k, p k ) of D with a random variable r 0k taking values in R +, also r 0k and r 0k are independent if k k. Now from marked Poisson process theory of Kingman suggests that {ω k, p k, r 0k )} k=1 can be viewed as random points drawn from a Poisson process in the product space Ω [0, 1] K R +, with compensator or Lévy measure ν # BD = Γ K β j) Γβ 1 ) Γβ K ) c pβ p K β K 1 p.) K β j 1 p.) c 1 d p A 0 dω) H 0 dr 0 ) where H 0 is a continuous finite measure on R + with its mass parameter Υ = H 0 R + ). Now with the previous equation, using A 0 H 0 as a base measure, we can construct a marked BD 156

157 process D # BDc, H 0 A 0, β 1,, β K )) as D # = p k δ ωk,r 0k ) k=1 where an atom ω k, r 0k ) comes with weight p k [0, 1] K. Now let us define i-th draw from a negative multinomial process as Y i NMPD # ). Y i = v ki δ ωk where v ki NMr 0k, p k ). k=1 The BD draw D # defines a set of parameters { p k, ω k, r 0k } k=1 and r 0k, p k ) are used to draw negative multinomial count vector denoted by v ki for each Y i draw and the atoms ω k are shared across all draws of Y i s. The count vector associated with a given atom ω k is a function of a index i, denoted by v ki Experiment with Simulated Data and Results We will use this BD Process BDP) coupled with Negative Multinomial process NMP) in some admixture model. From an NM draw, we get a Q-dimensional vector of counts for one mixing component. Now we can actually treat these counts to draw data from all the sub-categories for a particular cluster and typically cluster information is a latent variable and with Bayesian nonparametric techniques we can also infer the unknown number of clusters. We can first choose the number of data points associated with each sub categories for a cluster and then generate the data according to those counts from that cluster. So the following is the generative process for the data points First randomly draw Q number of counts m j1,, m jq for cluster j where j = 1, 2,, K. Now from each of the sub-clusters q of cluster j generate m jq number of data. For each i = 1, 2,, m jq generate data x jq from F θj, q)) where θj, q) is the appropriate parameter or set of parameters. Total number of data generated in that manner is K Q q=1 m jq. We will write down the model for actual experiment here. 157

158 1 α)ω 1 +αt 1 1 α)ω 2 +αt 1 1 α)ω 3 +αt 1 1 α)ω 1 +αt 2 1 α)ω 2 +αt 2 1 α)ω 3 +αt 2 i d,1,1 m d,2,1 m d,3,1 i d,1,2 m d,2,2 m d,3,2 Ȳ d b d,1,1 b d,2,1 b d,3,1 b d,1,2 b d,2,2 b d,3,2 Ā d d = 1,,D b 0,1 b 0,2 b 0,3 ω 1 ω 2 ω 3 B 0 Ω Figure 5-6. The Hierarchical BP-BD-NM process with K = 3 and Q = 2 The above Figure 5-7 gives the basic schematic for our generative model in topic modeling. The Hierarchical Beta-Beta Dirichlet-Negative Multinomial HBBDNM) Process can be generated by Ā d = k Ȳ d = k B 0 = k b 0,k δ ωk BPθ 0, γ 0, H) [b d,k,1 b d,k,q ] T δ ωk BD [m d,k,1 m d,k,q ] T δ ωk NMr 0 d, Ā d ) ) B 0 θ d, γ d, β 1,, β Q ), B 0 Ω) d = 1, 2,, D 158

159 where H is the base measure for Beta Process BP). Here is the generative model for each document which is using bag-of-words framework. For each document d = 1, 2,, D, there is an exchangeable observations {x d,n } N d n=1. Let us assume ω k are the latent topics for the corpus and z d,n is the topic associated with x d,n. Now in this scenario with BD, we have the following generative model, Algorithm 9 Generative model for the corpus for each document d, d = 1,, D do for each topic k, k = 1,, K do for each category j, j = 1,, Q do for l = 1, 2,, m d,k,j do Draw a Bernoulli random variable c with success probability α if c is 1 then Generate a word from the vocabulary from a known basic topic T j else Generate a word from the vocabulary from an unknown topic ω k end if end for end for end for end for Synthetic Data In order to build the synthetic data, we take W = 100 word length vocabulary. We take 20 words from the following 5 topics - Computer, Biology, Computational Biology, Computational Neuroscience and Bioinformatics. Now among these topics we realize Computer and Biology are like super-topic and under these two we have 3 topics which somehow combines them together but in a very different way most likely. We have given the generative model in the above figure 5-7. Here T 1 is the topic Computer and T 2 is the topic Biology which are known. Now based on the above generative model, we generate 300 documents with on average 250 to 400 words in it. Now we run our inference techniques on this synthetic data-set. We ran a Gibbs sampler where all the conditional distribution are mentioned below. We did 1000 burn in and collected 1000 samples after every 5 iterations to remove a possible 159

160 correlation. We can see that we somewhat recovered the underlying latent topics of Computational biology, bioinformatics and Computational Neuroscience as the top 6 topics. r 0 for every document denoted by r 0 d was set by using the following equation r 0 d = N d θ 0 1) θ 0 γ 0 where this estimate comes from the expectation of NM distribution. Also note that the vocabulary words are currently denoted by topic # and these are the representative words for that particular topic. We will state and show two facts here Inference for BDNM Model Negative Multinomial likelihood is conjugate to Beta Dirichlet prior The negative multinomial NM) distribution is a generalization of the negative binomial distribution for more than two outcomes. We have see that BD distribution is conjugate to NM likelihood Negative Multinomial as a mixture of Gamma and multivariate independent PoissonMIP) X = n 1,, n Q ) NMr, p 1,, p Q ) is equivalent to the following mixture λ Gammar 0, 1 p.) X Multivariate-Poisson p 1 λ, p 2 λ,, p Q λ) where p. = Q p i and note that we are using the Gamma distribution for λ with shape parameter α > 0 and rate parameter β > 0 and the distribution function looks like β α Γα) λα 1 e βλ The mixture will look like, [ Q pj λ) n ) ] [ j exp p j λ) 1 p.) r 0 n j! Γr 0 ) 0 ] λ r0 1 exp 1 p.)λ) dλ 160

161 Figure 5-7. Top 6 topics and their top 20 words 161

Bayesian nonparametrics

Bayesian nonparametrics 1 Some preliminaries 1.1 de Finetti s theorem We will start our discussion with this foundational theorem. We will assume throughout all variables are defined on the probability