Probabilistic Models for Multi-relational Data Analysis

Size: px

Start display at page:

Download "Probabilistic Models for Multi-relational Data Analysis"

Cameron Cook
5 years ago
Views:

1 Probabilistic Models for Multi-relational Data Analysis A DISSERTATION SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY Hanhuai Shan IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy Arindam Banerjee June, 2012

3 Acknowledgements There are many people that have earned my gratitude for their contribution to my time in graduate study. First, I would like to express my sincere gratitude to my advisor Prof. Arindam Banerjee for his invaluable support and guidance throughout my graduate study. I would also like to thank him for introducing me to the area of machine learning and data mining. His enthusiasm in research and his serious work attitude had a great influence on me. I feel lucky to have him as my advisor and I really enjoyed working with him. Second, I would like to thank Prof. Daniel Boley, Snigdhansu Chatterjee and Vipin Kumar for agreeing to be on my thesis committee. I am also grateful to people who I have collaborated with: Amrudin Agovic, Ramesh Natarajan, Nikunj Oza, Guillermo Sapiro, Hongjun Wang, and Tinghui Zhou. It was my honor to work with them. I would also like to show my gratitude to my lab mates; Amrudin Agovic, Soumyadeep Chatterjee, Aritra Chowdhury, Amir Taheri, Puja Das, Qiang Fu, and Huahua Wang. They made my Ph.D. life more colorful. I will miss our discussions, travel to the conferences, group dinners, etc.. Finally, my special thanks goes to my family my parents, my husband, and my dear little daughter for all their love, support and encouragement. i

4 Dedication To my parents. ii

5 Abstract With the widespread application of data mining technologies to real life problems, there has been an increasing realization that real data are usually multi-relational, capturing a variety of relations among objects in the same or different entities. For example, in movie recommender systems, the movie rating matrix captures the relation between movies and users, the social network captures the relation among users, and the cast of the movies captures the relation between movies and actors/actresses. The multi-relational data analysis on such data includes two important tasks: (1) To discover multi-relational clusters across multiple entities, i.e., multi-relational clustering. (2) To predict missing entries, i.e., multi-relational missing value prediction. Clustering and missing value prediction give us a better understanding of data and help us with decision making. For example, clusters of users and movies, as well as whether each user cluster likes each movie cluster, provide us with a high-level overview of movie rating data. In addition, the prediction of the missing ratings helps us decide whether to recommend the movies to corresponding users. Moreover, it is particularly meaningful to perform clustering and missing value prediction under the multi-relational setting, since they are able to combine multiple sources of information together effectively, which usually outperforms the algorithms on a single source of data alone. We develop probabilistic models for multi-relational data analysis due to their advantage in incorporating prior knowledge from multiple sources through prior distributions, and their modularity in combining multiple models through sharing latent variables. By performing experiments on a variety of data sets, such as movie recommendation data and ecological data on plant s traits, we show that multi-relational clustering and missing value prediction have superior performance compared to the algorithms on a single data source only. iii

6 Contents Acknowledgements Dedication Abstract List of Tables List of Figures i ii iii x xii 1 Introduction Motivation Overview Mixed-membership Models Discriminative Mixed-membership Models Bayesian Cluster Ensembles Bayesian Co-clustering Parametric Probabilistic Matrix Factorization Probabilistic Matrix Factorization with Side Information Probabilistic Tensor factorization Main Contributions Mixed-membership Models Preliminaries Finite Mixture Models iv

7 2.1.2 Naive Bayes Models Latent Dirichlet Allocation Mixed-membership Naive Bayes Models Inference and Estimation Variational Approximation Parameter Estimation Variational EM for MMNB Fast Variational Inference Variational Approximation Parameter Estimation Fast LDA Experimental Results Datasets Results for MMNB vs. NB Results for Fast MM vs. MM Results for Cluster Assignments of MMNB Conclusion Discriminative Mixed-membership Models Discriminative LDA Discriminative MMNB Inference and Parameter Estimation Experimental Results DMM vs. MM Fast DMM vs. Other Classification Algorithms Topics from Fast DLDA Conclusion Bayesian Cluster Ensembles Problem Definition Missing Value Cluster Ensembles Row-distributed Cluster Ensembles Column-distributed Cluster Ensembles v

8 4.2 Bayesian Cluster Ensembles Variational Inference for BCE Row-distributed EM Algorithm Column-distributed EM Algorithm Experimental Results General Cluster Ensembles Cluster Ensembles with Missing Values Cluster Ensembles with Increasing Columns Row-distributed Cluster Ensembles Column-distributed Cluster Ensembles Conclusion Bayesian Co-clustering Bayesian Co-clustering Inference and Learning Inference Parameter Estimation Experimental Results Experiments on Simulated Data Experiments on Real Data Conclusion Residual Bayesian Co-clustering Residual Bayesian Co-clustering Inference and Learning Variational Inference Prediction Parallel RBC Fully Factorized Variational Distribution Prediction Parallel RBC Experimental Results Missing Value Prediction for Existent Rows and Columns vi

9 6.4.2 Missing Value Prediction for New Rows and Columns Running Time of Parallel RBC Conclusion Parametric Probabilistic Matrix Factorization Preliminaries Parametric PMF Residual PPMF Experimental Results PPMF vs. Co-clustering Algorithms PPMF vs. PMF and BPMF PPMF vs. Residual PPMF Conclusion Probabilistic Matrix Factorization with Features PMFF Experimental Results Conclusion Kernelized Probabilistic Matrix Factorization Kernelized Probabilistic Matrix Factorization KPMF Versus PMF/BPMF Gradient Descent for KPMF Stochastic Gradient Descent for KPMF Prediction Experiments on Recommender Systems Datasets Graph Kernels Methodology Results Experiments on Image Restoration Conclusion vii

10 10 Hierarchical Probabilistic Matrix Factorization HPMF for Trait Prediction Experimental Result Dataset Accuracy in Trait Prediction Trait Correlation True Trait vs Predicted Trait Conclusion Probabilistic Tensor Factorization Parametric Probabilistic Tensor Factorization PPTF Variational Inference for PPTF Prediction Bayesian Probabilistic Tensor Factorization BPTF Inference and Prediction Experimental Results Dataset Result on Point Estimate Result on Multiple Imputation Conclusion Conclusion 194 References 196 Appendix A. Variational Inference for mixed-membership models 209 A.1 MMNB A.1.1 Variational Inference A.1.2 Parameter Estimation A.2 Fast MMNB A.2.1 Variational Inference viii

11 A.2.2 Parameter Estimation A.3 Fast LDA A.3.1 Variational Inference A.3.2 Parameter Estimation A.4 DLDA and DMMNB ix

12 List of Tables 1.1 Overview of the thesis The number of data points, features and classes in each UCI dataset Perplexity of MMNB and NB on UCI data Running time (seconds) of Fast MM and MM Word list for three topics on Nasa Word list for three topics on Classic Word list for three topics on Diff Word list for three topics on Sim Word list for three topics on Same Updates for variational parameters in DMM and Fast DMM Accuracy for LDA and DLDA (k=t) on text data Accuracy for MMNB and DMMNB (k=t) on UCI data Running time (seconds) of DLDA and Fast DLDA on text data Running time (seconds) of DMMNB and Fast DMMNB on UCI data Accuracy on text data for Fast DLDA and other classification algorithms Accuracy on UCI data for Fast DMMNB and other classification algorithms Accuracy on text data from Fast LDA+LR and Fast DLDA Accuracy on UCI from Fast MMNB+LR and Fast DMMNB Extracted Topics from Nasa dataset using Fast DLDA The number of data points, features, and classes in each data set The applicability of algorithms to different experimental settings Cluster ensemble results using k-means as base clustering algorithms Cluster ensemble results using different base clustering algorithms Expression for terms in the lower bound function x

13 5.2 Micro-precision on simulated data Perplexity of BCC, MMNB, and LDA Expressions for terms in L using q MSE of RBC on Movielens compared to other co-clustering algorithms MSE of RBC on Foodmart compared to other co-clustering algorithms MSE of RBC on Movielens compared to SVD, NNMF and CORR MSE of RBC on Foodmart compared to SVD, NNMF, and CORR MSE from PPMF and co-clustering based algorithms MSE from PMF, BPMF and PPMF MSE from PPFM and residual PPMF MSE for PMFF compared to PPMF Number of users, items and ratings of the data sets used RMSE and Coverage from SNB RMSE on users with no ratings for training Comparison of RMSE and running time for KPMF GD and KPMF SGD Trait ID, names, number of non-missing entries and definition RMSE of HPMF and other methods Acronyms for the compared algorithms RMSE on retail sales data with 90% training data RMSE on retail sales data with 10% training data xi

14 List of Figures 1.1 The structure of the chapters in the thesis An overview of mixed-membership models Graphical model representation of NB and LDA Graphical model representation of MMNB Variational distributions for MM and Fast MM models Perplexities of NB and MMNB with k = 20 and varying ϵ on Movielens Perplexity surfaces of NB and MMNB over a range of k and ϵ on Movielens Perplexity of Fast MM compared to MM Histogram of cluster membership entropy on Glass for MMNB and Fast MMNB Posterior over 6 components at different stages of an E-step Perplexities with ascending cluster membership entropy on UCI data An overview of discriminative mixed-membership models Graphical models for DLDA and DMMNB Two ways of processing base clustering results for cluster ensemble Average MP with increasing percentage of missing values Average M P comparison with increasing number of base clusterings Average MP with increasing number of distributed partitions Running time for column-d Bayesian co-clustering model Variational distribution q Parameter estimation for Gaussian Perplexity comparison of BCC, MMNB and LDA on binary Jester Perplexity comparison of BCC and MMNB on Movielens Perplexity curves with increasing percentage of noise xii

15 5.7 Perplexity curves of BCC and LDA with increasing noise on binarized Jester Perplexity curves of BCC and LDA with increasing noise on binarized Movielens Co-cluster parameters for Movielens Co-embedding and signatures for users and movies on Movielens data The graphical model for RBC Variational distribution q MSE on Jester compared to different algorithms with different choices of (k 1, k 2 ) MSE of RBC and BCC on Jester for new users and jokes Running time of parallel RBC and SVD The graphical model for PMF The graphical model for PPMF Graphical model for residual PPMF Graphical model for PMFF Graphical model for KPMF PMF/BPMF and KPMF Examples for input data Statistics of Flixster and Epinion data used RMSE for different algorithms on Flixster and Epinion Performance improvement from PMF with increasing amount of training data The graph constructed for the rows of the image Image restoration results using KPMF HPMF model for TRY data RMSE of HPMF and LPMF with increasing number of iterations RMSE MEAN -RMSE HP MF on two parts of test data Scatter plot for pairs of traits Scatter plot for (true value, predicted value) on test data CP decomposition of a tensor R Price and unit sales changes of one product in one store over a period of time Graphical model for PPTF Graphical model for BPTF The histogram of the entry values in price and unit-sales tensor RMSE of PPTF on synthetic data of missing entries and missing fibers xiii

16 11.7 RMSE of BPTF on synthetic data of missing entries and missing fibers The running time on synthetic data Running time on retail sales data with 90% of training data Running time on retail sales data with 10% training data RMSE on entries with ascending standard deviation of samples xiv

17 Chapter 1 Introduction With the widespread application of data mining technologies to real life problems, there has been an increasing realization that real data are usually multi-relational, capturing a variety of relations among objects in the same or different entities, where an entity is defined as a group of objects of the same type. For example, in movie recommender systems, there are entities as movies, users, and actors/actresses connecting with each other. The movie rating matrix connects movies and users, the social network connects users, and the cast of the movies connects movies and actors/actresses, etc. A multi-relational data set is hence composed of multiple aspects, e.g., the movie recommendation data includes movie ratings, user s social network, movie s cast, etc. In many other domains, multi-relational data also widely exists: In retail sales data, we have the product s price information in different stores, the customer s purchasing records on products, and hierarchical information for stores based on their locations as city, province, country, etc.. Moreover, in online advertisement, we have data on advertisements shown on certain webpages for certain companies, and the webpages are connected through hyperlinks. Such multi-relational data contains rich information, to make good use of it, it is crucial to perform multi-relational data analysis which combines multiple sources of data effectively. 1

18 1.1 Motivation 2 In this thesis, we discuss multi-relational data analysis on two important tasks: (1) To discover multi-relational clusters across multiple entities, viz., multi-relational clustering. (2) To predict missing entries, viz., multi-relational missing value prediction. For task (1), traditional clustering algorithms [40, 36, 43, 58] consider the data set as a collection of feature vectors and perform clustering on the feature vectors. Such a simple strategy may have several limitations as follows: First, it breaks the correlations among features. Given a movie rating matrix, the traditional clustering algorithms would treat the ratings on different movies from each user as a feature vector for the user and perform clustering on the feature vectors, i.e., movies becoming the features. In that case, if user A likes Toy Story 2 and user B likes Toy Story 3, they will probably be assigned to two different clusters since they are not sharing a similar value under the same feature. Such results are obviously not satisfactory. In comparison, multirelational clustering would perform a clustering on users and movies simultaneously. It puts Toy Story 2 and 3 in a same movie cluster, then user A and B will also be in the same user cluster since they have similar preference on the same cluster of movies, though not on the same individual of movie. In addition, multi-relational clustering not only generates clusters for objects in each entity, but also gives the relations among the clusters in different entities. For instance, clustering on movie rating data not only gives user and movie clusters, but also gives whether each user cluster likes each movie cluster. Such results provide us with a high-level overview of movie rating data and further a better understanding of data. What is more important, the relations among the clusters usually helps decision making. If we know that a cluster of users like a cluster of movies, e.g., movies on romantic stories during wars, we can recommend new movies under this category to users in the corresponding cluster. The second limitation of traditional clustering is that it is unable to make use of multiple sources of information simultaneously. For example, if we have a movie rating matrix and a social network among the users, there is no elegant way in traditional clustering algorithm to combine both information together for user clustering. Multi-relational clustering, in this case, is able to merge multiple sources of information effectively. For task (2), most of the multi-relational data has a large number of missing entries.

19 3 For instance, according to [110], the density of non-missing ratings in most commercial recommender systems is less than 1%. Therefore, an accurate prediction is important to directly help decision making. We can recommend the movie to the user if we know he or she would like it. While matrix factorization based strategies [16, 107, 100, 76, 109] have become the state of the art in collaborative filtering for missing value prediction, one limitation of the traditional matrix factorization algorithms is that they can only work on the target matrix but are unable to incorporate the side information into the algorithm. Side information widely exists in the real problems. As we have mentioned above, it could be the social network among users in movie recommender systems, where we need to predict the missing ratings; the hierarchical location information for the stores in the retail sales data, where we need to predict the missing purchasing records for the users; etc.. In the above settings, how to incorporate such side information effectively for more accurate missing value prediction becomes an interesting and meaningful task. In this thesis, we work on several types of multi-relational data as follows: 1. The simplest type is the classic form of data a collection of feature vectors for objects in one entity, such as most of the data sets in UCI machine learning repository 1. We refer to such data as one-way data. 2. Two-way data captures the relationship between two entities. They are usually represented as a data matrix, with rows and columns denoting one entity respectively and the entries denoting their relationship. For example, the movie rating matrix has rows and columns as users and movies, and each entry is the preference relationship between the corresponding user and movie. 3. Similar with two-way data, three-way data captures the relationship between three entities and could be represented as a three-way tensor. For example, in online advertising, we could have a click-through rate tensor with three dimensions being advertisements, webpages and time respectively. 4. Other than the three types above, other complicated types of data, such as 4-way tensors, or combinations of two-way and one-way data, are referred to as multiway data. Concrete examples we use in this thesis are a movie rating matrix with 1

20 4 Table 1.1: Overview of the thesis. Clustering Missing value prediction One-way data Two-way data Three-way data Multi-way data movies information and/or user s social network as side information. A main goal for clustering and missing value prediction in multi-relational data setting is to combine multiple sources of information together. Taking the data with a movie rating matrix and user/movie s side information as an example, to combine both sources of information to predict the ratings, one can either perform a joint matrix factorization [44, 82] on both the rating matrix and side information, or use the side information in regularization when factorizing the rating matrix [57, 83]. In this thesis, we use probabilistic models due to the following reasons: First, it is easy to incorporate the prior knowledge from multiple sources through prior distributions in probabilistic models. Second, probabilistic models are modular. Therefore, with one probabilistic model working on one source of data, it is easy to combine multiple probabilistic models for multiple sources of data through sharing the latent variables. Third, probabilistic models can capture the uncertainty of the data and the results. Finally, probabilistic models are usually interpretable, which gives us a better understanding of the results. 1.2 Overview In this section, we give an overview of the thesis. As shown in Table 1.1, we are going to discuss clustering on one-way and two-way data, and missing value prediction on twoway, three-way and multi-way data. The main structure of the chapters is in Figure 1.1. In particular, Chapters 2-6 will discuss clustering, with Chapters 2-4 for clustering on one-way data, and Chapters 5-6 for clustering on two-way data. Chapters 7-11 will discuss missing value prediction, with Chapter 7 for missing value prediction on twoway data, Chapter 11 on three-way data, and Chapters 8-10 on multi-way data.

21 5 Clustering One-way data Two-way data Mixed-membership Models (Chapter 2) Discriminative mixed-membership Models (Chapter 3) Bayesian Cluster Ensembles (Chapter 4) Bayesian Co-clustering (Chapter 5) Residual Bayesian Co-clustering (Chapter 6) Two-way data Parametric Probabilistic Matrix Factorization (Chapter 7) Missing value prediction Three-way data Multi-way data Probabilistic Tensor Factorization (Chapter 11) Parametric Probabilistic Matrix Factorization with Features (Chapter 8) Kernelized Probabilistic Matrix Factorization (Chapter 9) Hierarchical Probabilistic Matrix Factorization (Chapter 10) Figure 1.1: The structure of the chapters in the thesis Mixed-membership Models In Chapter 2, we introduce a family of generative models which allows mixed-membership clusterings. In particular, we introduce a family of mixed-membership naive Bayes (MMNB) models [114, 13], effectively by taking the best of both naive Bayes models and mixed-membership topic models. MMNB models are significantly more flexible than NB models by using a Dirichlet-discrete prior, while inheriting NB s advantage to deal with heterogenous data. We propose two variational inference algorithms for MMNB. The first is based on the ideas in latent Dirichlet allocation (LDA) [25]. The second uses a substantially smaller number of variational parameters, with no dependency on the dimensionality of the dataset, and an application of the same idea in the context of topic modeling gives a new Fast LDA algorithm. By design, the new algorithm has substantially smaller memory requirements, and is orders of magnitudes faster.

22 1.2.2 Discriminative Mixed-membership Models 6 In Chapter 3, we propose discriminative mixed-membership (DMM) models [114, 117] by combining multi-class logistic regression with unsupervised mixed-membership models. In particular, we consider two variants discriminative latent Dirichlet allocation (DLDA) and discriminative mixed-membership naive Bayes (DMMNB). DLDA is applicable to text data and uses LDA [25] as the underlying mixed-membership model. DMMNB is applicable to non-text data and uses MMNB as the underlying mixed-membership model. We also have fast inference algorithms for DMM following the ideas in Chapter 2. In experiments, we show that Fast DMM models achieve higher/competitive performance compared to the state-of-the-art classification algorithms Bayesian Cluster Ensembles Cluster ensembles provide a framework for combining multiple base clusterings of a dataset to generate a stable and robust consensus clustering. There are important variants of the basic cluster ensemble problem, notably including cluster ensembles with missing values and distributed cluster ensembles. Existing cluster ensemble algorithms are applicable only to a small subset of these variants. In Chapter 4, we discuss Bayesian cluster ensemble (BCE) [130, 131], which is a mixed-membership model for learning cluster ensembles, and is applicable to all primary variants of the problem. We compare BCE extensively with several other cluster ensemble algorithms, and demonstrate that BCE is not only versatile in terms of its applicability but also outperforms other algorithms in terms of stability and accuracy Bayesian Co-clustering Co-clustering generates clusters of rows and columns of a data matrix simultaneously. In Chapter 5, we discuss Bayesian co-clustering (BCC) [112, 115] by viewing co-clustering as a generative mixture modeling problem. We assume each row and column to have a mixed membership respectively, from which we generate row and column clusters. Each entry of the data matrix is then generated given that row-column cluster. BCC can use any exponential family distribution [12] to generate the entry, which allows

23 7 BCC to be applied to a wide variety of data types, such as real, binary, or discrete matrices. Chapter 6 discusses a residual version of BCC by performing co-clustering on the residual matrix after removing the row and column biases Parametric Probabilistic Matrix Factorization In Chapter 7, we discuss parametric probabilistic matrix factorization (PPMF) [113] for missing value prediction. It is a probabilistic model sitting between probabilistic matrix factorization (PMF) [107] and Bayesian probabilistic matrix factorization (BPMF) [108]. It allows a non-diagonal covariance matrix hence is more general than PMF, and is simpler than BPMF since it does not maintain distributions over all covariance matrices. The motivation is to avoid the independence assumption of latent factors in PMF, and avoid the full Bayesian treatment in BPMF to simplify the learning process Probabilistic Matrix Factorization with Side Information From Chapters 8 to 10, we discuss how to incorporate side information into probabilistic matrix factorization [113, 135, 118]. There are three types of side information we discuss in this thesis: First is feature vectors. We consider the movie information such as cast or genre for the movies in the movie rating matrix. Second is graphs. We consider users social network for the users in the movie rating matrix. Third is the hierarchical structures. We focus on the phylogenetic information for the plants in the plant s trait matrix. Chapters 8-10 discuss how to incorporate these three types of side information respectively Probabilistic Tensor factorization In Chapter 11, we discuss probabilistic tensor factorizations for missing value prediction. In particular, we propose two models parametric probabilistic tensor factorization (PPTF) and Bayesian probabilistic tensor factorization (BPTF) [116]. Both of them are instances of CANDECOMP/PARAFAC (CP) tensor decomposition [70]. PPTF corresponds to PPMF, and BPTF corresponds to BPMF in matrix factorization setting. By running experiments on retail sales data, we show that PPTF and BPTF are competitive to the state of the art, and PPTF is much more efficient than most of

24 the algorithms Main Contributions The thesis discusses a Bayesian mixed-membership model for clustering and classification on one-way data, including text and numerical data. A fast variational inference algorithm is proposed for learning and inference. The fast variational inference is orders of magnitude faster than the regular variational inference algorithm used in the prior work, with competitive clustering and classification performance. Applying mixed-membership model to cluster ensembles yields Bayesian cluster ensembles, which combines multiple clustering results to generate a consensus clustering. Bayesian cluster ensemble is more versatile than the state of the art, with better/competitive performance. Applying mixed-membership model to clustering on two-way data yields Bayesian co-clustering. It generates clusters on rows and columns of a data matrix simultaneously. Another main contribution of the thesis is probabilistic matrix factorization with side information. Three types of side information are considered: feature vectors, graphs, and hierarchical structures. In particular, the feature vectors are incorporated through topic models, the graphs are incorporated using graph kernels, and the hierarchical structure is incorporated through using upper level information as the prior for lower level matrix factorization. The proposed models are able to incorporate the side information effectively for a better missing value prediction.

25 Chapter 2 Mixed-membership Models For clustering on one-way data, probabilistic mixture models are arguably one of the most popular approaches to latent cluster structure discovery [105, 86, 12]. Naive Bayes (NB) models are a special case of such generative mixture models and have found successful applications in a wide variety of problem domains [97, 39, 96]. In NB models, the probability of a feature vector conditioned on a particular mixture component is assumed to fully factorize over individual features. In spite of their vast popularity, mixture models in general, and NB models in particular have an important restriction that limits their modeling capabilities: they do not allow each data point to belong to different components with varying degrees, i.e., they do not allow mixed memberships. In a recommendation system scenario, such an assumption may indicate that each user only likes one type of movies. In text mining, such an assumption implies that a document can be on only one topic. In reality, the assumption is clearly not true, and becomes a restriction of the mixture models modeling capability. There are a few existing approaches to relax this assumption to mixed membership. However, most of such mixed-membership models only work with a specific type of data [20] such as text or real valued features, but have not been systematically generalized to deal with arbitrary data types or heterogeneous feature vectors (e.g., user s personal information in online shopping systems, such as age, occupation, monthly expense, etc.), where NB models are still the methods of choice [92, 134]. Meanwhile, for most of such mixed-membership models, learning the model through a direct application of expectation maximization (EM) [36] algorithm is usually intractable. Two most popular types of approaches to 9

26 10 Unsupervised clustering For text clustering For non-text clustering (a) LDA Fast LDA MMNB Fast MMNB MM Fast MM Acronym LDA Fast LDA MMNB Fast MMNB MM Fast MM Full name Latent Dirichlet allocation Fast latent Dirichlet allocation Mixed-membership naïve Bayes Fast mixed-membership naïve Bayes Mixed-membership models Fast mixed-membership models Figure 2.1: An overview of mixed-membership models. (a) is the relationship and structure of the models. (b) is the acronyms of the models. (b) address the problem are variational approximation [62, 25] and Gibbs sampling [53, 56]. Unfortunately, most of these existing algorithms are computationally expensive, which restricts the model s wide application to large datasets in real-life cases. In this chapter, we introduce a family of generative models which allows mixedmembership clusterings, while almost maintaining the simplicity of NB models. In particular, we introduce a family of mixed-membership naive Bayes (MMNB) models, effectively by taking the best of both NB models and mixed-membership topic models such as LDA. MMNB models are significantly more flexible than NB models by using a Dirichlet-discrete prior, while inheriting NB s advantage to deal with heterogenous data. We propose two variational inference algorithms for MMNB, as well as corresponding variational EM algorithms to learn the parameters for any regular exponential family distributions [12, 15]. The first inference algorithm is based on the ideas originally proposed in the context of LDA [25]. The second algorithm uses a substantially smaller number of variational parameters, with no dependency on the dimensionality of the dataset, and an application of the same idea in the context of topic modeling gives a new Fast LDA algorithm. By design, the new algorithm has substantially smaller memory requirements, and is orders of magnitudes faster, where the speedup times roughly increases with the dimensionality of data, i.e., the higher dimension the data has, the more computational achievements the algorithm gains. Figure 2.1(a) gives an overview of the mixed-membership models we will introduce in this chapter, and Figure 2.1(b) gives the acronym of the models. In particular, LDA and Fast LDA are for text clustering, and MMNB and Fast MMNB are for non-text

27 11 clustering. All models are new except LDA. LDA and MMNB together are referred to as mixed-membership (MM) models, and Fast LDA and Fast MMNB together are referred to as Fast mixed-membership (Fast MM) models. The effectiveness of mixed membership models are established through extensive experiments on various types of datasets. We show that MMNB models outperform NB models in most settings, and the performance of MMNB is found to be very stable across a wide range of parameter choices, especially on held out test sets. More importantly, the new variational inference algorithm is shown to be orders of magnitudes faster than the one used in LDA [25]. In our experiments, we achieve orders of magnitudes speedup for Fast LDA and 5-10 times speedup for Fast MMNB without noticeable loss. Recent years have seen rapid development in mixed-membership models. [61, 25, 56] are mixed-membership topic models for text data. In addition, correlated topic models [22] incorporates the correlation between topics, and dynamic topic models [23] captures the evolution of popular topics over years. There have also been a variety of generalizations of LDA to work on non-text data [42, 20, 60] by introducing various types of exponential family distributions. For fast inference in mixed membership models, [122] proposes collapsed variational inference for LDA by combining variational inference with collapsed Gibbs sampling. [103] proposes a fast algorithm for collapsed Gibbs sampling by only checking a subset of topics before drawing a correct sample. [95] accelerates LDA by doing inference in a distributed way. 2.1 Preliminaries In this section, we give a brief overview of the existing literature on mixture models, as a preliminary for mixed-membership models Finite Mixture Models Finite mixture (FM) models are arguably the most widely studied and used form of mixture models [105, 12]. An FM model is a convex combination of a finite number of latent component distributions, each of which generates a set of observed data points. To generate each data point x, an FM model first picks a component z = c and then generates the data point following the component distribution corresponding to c. If

28 π denotes a discrete distribution as a prior over k components, and θ c denotes the parameters for the distribution of the c th component, an FM model with k components has a density function of the following form: k p(x π, Θ) = p(z = c π)p(x θ c ), (2.1) c=1 where Θ = {θ c, [c] k 1 } ([c]k 1 c = 1,..., k) are the groups of parameters for the component distributions {p(x θ c ), [c] k 1 }. Most of the existing literature has focussed on the case where the component distributions belong to a regular exponential family [12, 15], and the most widely used mixture models are Gaussian mixture models, where each p(x θ c ) is a Gaussian distribution Naive Bayes Models Naive Bayes 1 (NB) models (Figure 2.2(a)) are a special case of FM models. NB assumes that the features of a data point are conditionally independent given the latent component. In particular, with an appropriate univariate exponential family [12, 15] on feature j and component c given by p ψj (x j θ jc ) = exp(x j θ jc ψ j (θ jc ))p j (x j ), the probability of an M-dimensional feature vector x given the component z = c is M M p(x θ c ) = p ψj (x j θ jc ) = exp(x j θ jc ψ j (θ jc ))p j (x j ), j=1 j=1 where ψ j ( ) is the cumulant or the log-partition function, and p j (x j ) is a non-negative base measure. ψ j ( ) determines the exponential family model appropriate for feature j, e.g., Gaussian, Poisson, Bernoulli, etc., and θ jc is the natural parameter corresponding to feature j and component c. Given the discrete distribution π over the components, the marginal probability of x according to naive Bayes is given by p(x π, Θ) = where Θ = {θ jc, [c] k 1, [j]m 1 }. k M p(z = c π) p ψj (x j θ jc ), (2.2) c=1 1 NB in the mixture models setting is an unsupervised clustering algorithm, as opposed to a supervised classification algorithm. j=1

29 13 θ β k M k π z x M N α π z x M N (a) Naive Bayes models (b) Latent Dirichlet allocation Figure 2.2: Graphical model representation of naive Bayes models and latent Dirichlet allocation Latent Dirichlet Allocation One key assumption of NB models, or FM models in general, is that the latent component z is fixed across all features of a data point x. While such an assumption is reasonable in certain domains, it puts a major restriction on the flexibility of NB models. LDA [25, 56] is an elegant extension of standard mixture models by relaxing this assumption in the context of topic modeling, where each data point is a collection of tokens, e.g., a document with a collection of words. LDA assumes that each word in a document potentially comes from a separate topic z, which is generated from a discrete distribution discrete(π) of this document, and all documents share a k-dimensional Dirichlet prior α. The generative process for each document x is as follows (Figure 2.2(b)): 1. Choose a mixed-membership vector π Dirichlet(α). 2. For each of M words (tokens) (x j, [j] M 1 ) in x: (a) Choose a topic (component) z j = c discrete(π). (b) Choose x j discrete(β c ). β = {β c, [c] k 1 } is a collection of parameters for k component distributions, with each of them a V dimensional discrete distribution where V is the total number of words in the dictionary. LDA assumes that words are generated from topics, and the topics are exchangeable within a document. Recall that according to de Finetti s representation theorem [34], if the joint distribution of a set of random variables is invariant to permutation, these random variables could be considered as independent and identically distributed

30 14 conditioned on a latent parameter, which is drawn from a certain distribution. In LDA, the random variables in question are the topics corresponding to the words, and the latent parameter is π for the discrete distribution, which is drawn from the Dirichlet distribution Dirichlet(α). The density function of a document x is given by M k p(x α, β) = p(π α) p(z j = c π)p(x j β c ) dπ. (2.3) π j=1 c=1 Computing the probability of a collection of documents is intractable, and several approximate inference techniques have been proposed to address the problem. The two most popular approaches include variational approximation [62, 25] and Gibbs sampling [53, 56]. 2.2 Mixed-membership Naive Bayes Models In this section, we first take a careful look at the strengths and limitations of NB models and LDA, and then propose MMNB models by taking the best of both worlds. A data point in LDA [25] is a collection of tokens, each of which is assumed to be generated from one of the discrete component distributions. The tokens represent the same type of objects, e.g., in case of LDA, all tokens are words. The set of distributions remain the same across all tokens. In several applications, there are two important deviations from the above set-up: 1. Each feature may have a measured value, e.g., real, categorical, etc.. LDA is not designed to deal with such data since it only works with tokens. 2. Features may be heterogeneous. By heterogeneous, we mean the feature vector containing features of different semantics (e.g. height, weight), different data types (e.g. real, integral), different ranges of values (e.g. [-1,0], [10,100]), etc.. Using a homogeneous component distribution, LDA is not directly applicable to such heterogeneous features. As for NB models, while they have been widely used due to their simplicity, and can handle heterogeneous features with measured values, they also suffer from two important limitations:

31 1. Most large-scale datasets are sparse, so most feature values will be unknown. For example, in a movie recommendation setting, each user would have rated only a very small fraction of all available movies. NB models have no explicit mechanism to handle missing values. 2. Unlike LDA, NB models are not mixed-membership models because they assume that all the features in a feature vector come from the same mixture component. Such a mixture of unigrams approach [25] yields simplicity, but puts a severe restriction on the modeling power of NB. To address the first drawback of NB models, we introduce marginal naive Bayes models where the model itself takes into consideration the sparsity structure of the data points. For an M-dimensional feature vector x with only a subset of non-missing features, the density function is given by p(x π, Θ) = c=1 15 k M p(z = c π) p (x ψj j θ jc), (2.4) j=1 x j where x j denotes any observed feature j for x. Note that the observed feature sets will be potentially different for different x. Theoretically, the model simply marginalizes over all possible values of the missing features. Operationally, the model is only built over the features whose values are observed, e.g., the movies that have been rated by a certain user. By focusing only on the observed features, marginal NB can naturally handle sparsity, but it inherits the second problem of NB models, i.e., all features are assumed to be generated from the same component z. Meanwhile, as a mixed-membership model, LDA allows tokens in a data point to be generated from different components. We adopt the same idea of LDA in the context of marginal NB, and propose mixed-membership naive Bayes models. In particular, we allow each observed feature x j of a data point to potentially come from a separate component z j, which has a Dirichlet-discrete prior on top as in LDA. Given z j, the generation of each feature still follows marginal NB, which allows MMNB to handle heterogenous features with various types of measured values, so the two limitations of LDA are conveniently addressed. Overall, as a combination of LDA and marginal NB, MMNB takes the best of these two to overcome the limitations of each other.

32 16 θ k M α π z x M N Figure 2.3: Graphical model representation of mixed-membership naive Bayes models. The graphical model for MMNB is given in Figure 2.3. The generative process for x following MMNB can be described as follows: 1. Choose a mixed-membership vector π Dirichlet(α). 2. For each non-missing feature x j of x: (a) Choose a component z j = c discrete(π). (b) Choose a feature value x j p ψj (x j θ jc ), where ψ j and θ jc jointly decide an exponential family distribution for feature j and component c. To make the model fully generative, we also need to generate the sparsity structure of the dataset. In principle, we can assume a Bernoulli(λ) for each entry in the dataset. The draws from Bernoulli(λ) determine which features of each data point are missing. Since estimation of λ can be done from the observed sparsity structure, and, in general, it does not affect the rest of the model, we will ignore this aspect. From the generative model, the density function for x is given by: M k p(x α, Θ) = p(π α) p(z j = c π)p ψj (x j θ jc ) dπ, (2.5) π j=1 x j c=1 where Θ = {θ jc, [j] M 1, [c]k 1 }. The probability of the entire dataset X with N data points X = {x i, [i] N 1 } is given by N M p(x α, Θ) = p(π i α) π i i=1 j=1 x ij k p(z ij = c π i )p ψj (x ij θ jc ) dπ i. (2.6) c=1

33 In LDA, an atomic event is the generation of a token (word) x j from a discrete component distribution, determined by z j. If there are k components, there would be k such discrete distributions, which are fixed for generating all words in the document. In MMNB, an atomic event is the generation of a value x j for the j th feature from an exponential family distribution p ψj (x j θ jc ). If there are k components and M features, the total number of component distributions would be k M, with k distributions for each of M features respectively. Unlike LDA, the distribution for generating x j not only depends on z j, but also depends on which feature is being considered. Therefore, by choosing an appropriate exponential family distribution for each feature, MMNB is able to deal with heterogeneous feature vectors. For a concrete exposition to MMNB models, we will focus on two specific instantiations of such models based on univariate Gaussian and discrete distributions for each feature in each component. Note that although the two examples we give have a same family of distributions across all features, MMNB in general allows different features to have different distributions and parameters. 1. MMNB-Gaussian: Such models have Gaussian distributions for each feature, hence are applicable to the data with real-valued features. Given the model parameters α and Ω = {(µ jc, σjc 2 ), [j]m 1, [c]k 1 }, the density function is given by: p(x α, Ω) = π M p(π α) j=1 x j k 1 p(z j = c π) 2πσjc 2 c=1 exp ( (x j µ jc ) 2 2σ 2 jc 17 ) dπ. (2.7) 2. MMNB-Discrete: Such models have discrete distributions for each feature, hence are applicable to the data with categorical features. Assuming that feature j can take r j possible values, each feature j and component c then has a discrete distribution {p jc (r), [r] r j 1 }, where p jc(r) 0 and r j r=1 p jc(r) = 1. Given the model parameters α and Ω = {p jc (r), [r] r j 1, [j]m 1, [c]k 1 }, the density function is given by M k p(x α, Ω) = p(π α) p(z j = c π)p jc (x j ) dπ. (2.8) π j=1 x j c=1

34 18 γ φ γ φ π M z N (a) MMNB/LDA π M z N (b) Fast MMNB/LDA Figure 2.4: Variational distributions for MM and Fast MM models. 2.3 Inference and Estimation For a given dataset X = {x 1,..., x N }, the learning task in MMNB is to estimate the model parameters (α, Θ ) such that the likelihood of observing the whole data set p(x α, Θ ) is maximized. A general approach for such a task is to use EM algorithms. However, the likelihood calculation in (2.6) is intractable, implying that a direct application of EM is not feasible. In this section, we propose a variational inference method, which alternates between obtaining a tractable lower bound to the true log-likelihood and choosing the model parameters to maximize the lower bound. To obtain a tractable lower bound, we consider an entire family of parameterized lower bounds with a set of free variational parameters, and pick the best lower bound by optimizing the lower bound with respect to the free variational parameters. For the details of derivations, please refer to Appendix A Variational Approximation In most applications of the EM algorithm for mixture modeling, in the E-step, one can directly compute the latent variable distribution [94, 9], which is used to calculate the expectation of the likelihood; in the M-step, parameter estimation is done by maximizing the expectation of the complete likelihood, where the expectation is with respect to the latent variable distribution. However, a direct computation of latent variable distribution p(π, z α, Θ, x) is not possible for MMNB models. In particular, the latent

35 variable distribution, given by 19 p(π, z α, Θ, x) = p(π α) M j=1, x j p(z j = c π)p ψj (x j θ jc ) ( π p(π α) M ) k j=1, x j c=1 p(z j = c π)p ψj (x j θ jc ) dπ (2.9) has an intractable partition function, which cannot be computed in a closed form. Hence, we introduce a tractable family of parameterized distributions q 1 (π, z γ, ϕ) as an approximation to p(π, z α, Θ, x), where (γ, ϕ) are free variational parameters. In particular, following [25], we focus on the family (Figure 2.4(a)) q 1 (π, z γ, ϕ) = q 1 (π γ) M q 1 (z j ϕ j ), (2.10) j=1 x j where for each data point, γ is a Dirichlet distribution parameter over π and ϕ = {ϕ j, [j] M 1, x j} are parameters for discrete distributions over the latent components z for all M non-missing features. Following Jensen s inequality [94, 25] we have log p(x α, Θ) E q1 [log p(π, z, x α, Θ)] + H(q 1 (π, z γ, ϕ)), (2.11) where H( ) denotes the Shannon entropy. Note that (2.11) gives a family of lower bounds to the true likelihood log p(x α, Θ), parameterized by (γ, ϕ). If we denote the corresponding lower bound for data point x i by L(γ i, ϕ i ; α, Θ), following (2.11), we have L(γ i, ϕ i ; α, Θ) =E q1 [log p(π i α)] + E q1 [log p(z i π i )] + E q1 [log p(x i z i, Θ)] (2.12) E q1 [log q 1 (π i γ i )] E q1 [log q 1 (z i ϕ i )]. The lower bound of the log-likelihood on the whole dataset X is simply the summation of L(γ i, ϕ i ; α, Θ) over all data points x i. The best lower bound can be computed by maximizing each L(γ i, ϕ i ; α, Θ) over the free parameters (γ i, ϕ i ). A direct calculation gives the following update equations that iteratively maximize the lower bound: M γ ic = α c + ϕ ijc (2.13) j=1 x ij ( ( k )) ϕ ijc exp Ψ(γ ic ) Ψ γ il p ψj (x ij θ jc ), [i] N 1, [j] M 1, [c] k 1, x ij,(2.14) l=1

36 where γ ic is the c th component of the parameter for variational Dirichlet distribution of the i th data point, ϕ ijc is the c th component of parameter for the variational discrete distribution of the j th feature in the i th data point, and Ψ is the digamma function, i.e., the first derivative of the log Gamma function. From [12], we know that any regular exponential family distribution p ψ (x θ) = exp( x, θ ψ(θ))p 0 (x) can be expressed in terms of the Bregman divergence between x and the expectation parameter τ as p ψ (x θ) = p f (x τ) = exp( d f (x, τ))b f (x), where f is the conjugate of the cumulant function ψ of the family, b f = exp(f(x))p 0 (x), and d f (, ) is the Bregman divergence determined by the function f. Therefore, (2.14) could be written as ( ( k ) ) ϕ ijc exp Ψ(γ ic ) Ψ γ il d fj (x ij, τ jc ), (2.15) where τ jc is the mean of the j th feature of the c th component. l=1 20 The above equation shows the following observation: ϕ ijc is inversely proportional to the exponential of Bregman divergence between the j th feature of the i th data point x ij and the expectation of the j th feature on the c th component τ jc, i.e., if x ij is far from the mean τ jc, its membership in component c will be small. In fact, ϕ ij = {ϕ ijc, [c] k 1 } gives the mixedmembership of x ij belonging to k components respectively. For a specific model, such as MMNB-Gaussian, the updating equation for ϕ ijc could be obtained by replacing the corresponding distributions in place of p ψj (x ij θ jc ) in (2.14). The form of the updates for γ ic is independent of the exponential family being used Parameter Estimation The goal of parameter estimation is to obtain (α, Θ) such that log p(x α, Θ) is maximized. Since the log-likelihood is intractable, we use the lower bound as a surrogate objective to be maximized. Note that for a fixed value of the variational parameters (γ i, ϕ i ) obtained by variational inference for each x i, the lower bound of log p(x α, Θ), i.e., N i=1 L(γ i, ϕ i ; α, Θ), is a function of the parameters (α, Θ). Following [105, 12], the parameters Θ can be estimated in a closed form for all exponential family distributions. From the Bregman divergence perspective, let τ jc be the expectation parameter for

37 the j th feature of the c th component, the estimation for τ jc is given by 21 τ jc = N i=1, x ij ϕ ijcs ij N i=1, x ij ϕ ijc, [j] M 1, [c] k 1, (2.16) where s ij is the sufficient statistic. The natural parameter θ jc is given by conjugacy as θ jc = f j (τ jc ), [j] M 1, [c] k 1, where f j ( ) is the conjugate of cumulant function ψ j for each feature. We now give the parameter estimation for two special cases MMNB-Gaussian and MMNB-Discrete. MMNB-Gaussian: For Gaussians, by maximizing the lower bound, the exact update equations for µ jc and σjc 2 can be obtained as µ jc = σ 2 jc = N i=1, x ij ϕ ijcx ij N i=1, x ϕ ij ijc N i=1, x ϕ ij ijc(x ij µ jc ) 2 N i=1, x ϕ ij ijc (2.17), [j] M 1, [c] k 1. (2.18) MMNB-Discrete: For a discrete distribution p jc over r = 1,..., r j values for feature j, the estimate of p jc (r) is given by p jc (r) N ϕ ijc 1(x ij = r), [c] k 1, [j] M 1 i=1, [r] r j 1, (2.19) where 1(x ij = r) is the indicator of observing value r for feature j in observation x i. While such a maximum likelihood (ML) estimate will give the maximizing parameters on an observed training set, there is possibility of some probability estimates being zero. Such an eventuality does not pose a problem on the training set, but inference on unseen or test data may become problematic. If a feature in the test set takes a value that it has not taken in the entire training set, the model will assign a zero probability to the entire set of test observations. The standard approach to address the problem is to use smoothing, so that none of the estimated parameters is zero. In particular, we use Laplace smoothing, which results from a maximum a posteriori (MAP) estimate [35] assuming a Dirichlet prior over each discrete distribution, so that

38 for some ϵ > 0. p jc (r) = N ϕ ijc 1(x ij = r) + ϵ, [c] k 1, [j] M 1 i=1 22, [r] r j 1, (2.20) The update of α is independent of the choice of exponential family distribution. Using Newton-Raphson algorithm [25, 91] with line search, the updating equation is given by: where ( ( k ) ) g c = N Ψ α l Ψ(α c ) + l=1 h c = NΨ (α c ) u = k l=1 g l/h l w 1 + k l=1 h 1 l k w = NΨ ( α l ). l=1 α c = α c η g c u h c, [c] k 1, (2.21) ( ( N k )) Ψ(γ ic ) Ψ γ il Since α has the constraint of α c > 0, by multiplying the second term of (2.21) by η, we are performing a line search to prevent α c to go out of the feasible range. At the beginning of each iteration, we set η to be 1. If the updated α c falls into the feasible range, the algorithm goes on to the next iteration, otherwise, it reduces α by a factor of 0.5 until the updated α c becomes valid. i=1 l= Variational EM for MMNB Based on the variational inference and parameter estimation updates, it is straightforward to construct a variational EM algorithm to estimate (α, Θ). Starting with an initial guess (α (0), Θ (0) ), the variational EM algorithm alternates between two steps: 1. E-Step: Given (α (t 1), Θ (t 1) ), for each data point x i, find the optimal variational parameters L(γ (t) i (γ (t) i, ϕ (t) ) = arg max L(γ i, ϕ i ; α (t 1), Θ (t 1) ). i (γ i,ϕ i ), ϕ (t) i ; α, Θ) gives a lower bound to log p(x i α, Θ).

39 2. M-Step: An improved estimate of model parameters (α, Θ) are obtained by maximizing the aggregate lower bound: 23 (α (t), Θ (t) ) = arg max (α,θ) N i=1 L(γ (t) i, ϕ (t) i ; α, Θ). After t iterations, the objective function becomes L(γ (t) i, ϕ (t) i ; α (t), Θ (t) ). In (t+1) th iteration, we have N i=1 L(γ (t) i, ϕ (t) ; α (t), Θ (t) ) i N i=1 N i=1 L(γ (t+1) i, ϕ (t+1) i ; α (t), Θ (t) ) L(γ (t+1) i, ϕ (t+1) i ; α (t+1), Θ (t+1) ). The first inequality holds because in the E-step, L(γ i, ϕ i ; α (t), Θ (t) ) achieves the maximum at (γ (t+1) i L(γ (t+1) i, ϕ (t+1) i ). Similarly, the second inequality holds because in the M-step, ; α, Θ) achieves the maximum at (α (t+1), Θ (t+1) ). Therefore, the objec-, ϕ (t+1) i tive function is non-decreasing until convergence. 2.4 Fast Variational Inference The variational distribution we have used in Chapter 2.3 exactly follows the idea proposed for LDA [25], where every feature j of the data point x i has a corresponding variational parameter ϕ ij for the discrete distribution. In this section, we introduce a different variational distribution with a smaller number of parameters, yielding a much faster variational inference algorithm. MMNB with such a fast variational inference algorithm is referred to as Fast MMNB. We also apply the same idea to LDA and come up with the Fast LDA algorithm. The details of derivation are presented in Appendix A.2 and A.3 for Fast MMNB and Fast LDA respectively Variational Approximation Given the lower bound to log-likelihood of each data point as (2.11) in Chapter 2.3, the variational distribution we have used is (2.10), where each non-missing feature j of each data point x i has a separate discrete distribution ϕ ij. In a full data matrix with N

40 M-dimensional data points, the total number of ϕ ij would be N M, which is a huge number for high-dimensional data. Meanwhile, since in the E-step of EM algorithm, the optimization is performed over each variational parameter, a large number of variational parameters will lead to a large number of optimizations to perform, significantly slowing the algorithm down. To make the algorithm more efficient, we introduce a new family of variational distributions (Figure 2.4(b)): M q 2 (π, z ϕ, γ) = q 2 (π γ) q 2 (z j ϕ). (2.22) j=1 x j Compared to q 1 (π, z ϕ, γ) in (2.10), q 2 (π, z ϕ, γ) only has one discrete distribution parameter ϕ for each data point. The total number of ϕs decreases from N M in (2.10) to N in (2.22), accordingly, the number of optimizations over ϕ also decreases from N M to N. Such a reduction implies a big saving on both time and space, especially for high dimensional data with a large M. Assuming there are M i non-missing features for each data point x i. 24 Given the variational distribution in (2.22), we have a set of new lower bounds L(γ i, ϕ i ; α, Θ) for p(x i α, Θ), and the best lower bound is obtained by maximizing L(γ i, ϕ i ; α, Θ) with respect to the variational parameters. The update equations for variational parameters become γ ic = α c + M i ϕ ic (2.23) 1/M i ( ( k )) M i ϕ ic exp Ψ(γ ic ) Ψ γ il p (x ψj ij θ jc), [i] N 1, [c] k 1, (2.24) l=1 j=1 x ij where γ ic and ϕ ic are parameters for variational Dirichlet and discrete distributions respectively for the c th component of x i. Comparing (2.24) to (2.14), in (2.14), we have the term p ψj (x ij θ jc ) in ϕ ijc for each feature j of x i, but in (2.24), since there is only one ϕ i for all features of x i, it contains the geometric mean of p ψj (x ij θ jc ) over all non-missing features. γ ic is again independent of the exponential family being used Parameter Estimation After obtaining the variational parameters, we can obtain a tractable lower bound of the log-likelihood as a function of the model parameters (α, Θ). The estimation for α

41 25 is the same as in Chapter 2.3 using Newton-Raphson algorithm with line search, and the estimation for Θ has a closed form for exponential family distributions. From the Bregman divergence perspective, assuming the expectation parameter for the j th feature of component c is τ jc, the estimation for τ jc is given by τ jc = N i=1, x ij ϕ ics ij N i=1, x ij ϕ ic, [j] M 1, [c] k 1, (2.25) where s ij is the sufficient statistic and the natural parameter θ jc = f j (τ jc ) by conjugacy, given f j ( ) the conjugate of cumulant function ψ j for each feature. For two special cases, MMNB-Gaussian and MMNB-Discrete, the closed form parameter estimates are given below. Note that (2.25)-(2.28) are mild variants of (2.16)-(2.19) as ϕ ic does not depend on feature j. Fast MMNB-Gaussian: For Gaussians, the update equations for µ jc and σ 2 jc are given by µ jc = σ 2 jc = N i=1, x ϕ ij icx ij N i=1, x ϕ ij ic N i=1, x ϕ ij ic(x ij µ jc ) 2 N i=1, x ϕ ij ic (2.26), [c] k 1, [j] M 1. (2.27) Fast MMNB-Discrete: For a discrete distribution p jc over r = 1,..., r j values for feature j, the update equation for p jc (r) is given by N p jc (r) ϕ ic 1(x ij = r) + ϵ, [c] k 1, [j] M 1 i=1, [r] r j 1 (2.28) where 1(x ij = r) is the indicator of observing value r for feature j in observation x i. Given the updates for variational and model parameters, a variational EM algorithm could be constructed similar as in Chapter Fast LDA We apply the same idea in Fast MMNB to variational inference in LDA [25], yielding Fast LDA. As in Figure 2.2(b), LDA has two model parameters α and β: α is the parameter of the Dirichlet distribution over π, and β is the set of the discrete distribution parameters for each of k components over V words, where V is the size of the

42 dictionary. Following the notation in [25], the v th word in the dictionary is represented by a V -dimensional vector x such that x v = 1 and x u = 0 for u v, and each document x is represented by M words x = {x 1, x 2,..., x M }. We introduce the same variational distribution for Fast MMNB as in Figure 2.4(b), i.e., for each document x, we introduce one Dirichlet distribution parameterized by γ and one discrete distribution parameterized by ϕ. In particular, the variational distribution is given by: q 2 (π, z ϕ, γ) = q 2 (π γ) 26 M q 2 (z j ϕ). (2.29) The lower bound of the log-likelihood in (2.3) is again obtained from Jensen s inequality as in (2.11). By taking derivative of the lower bound with respect to ϕ and γ respectively and setting them to zero, the update equations for variational parameters of x i are as follows: γ ic = α c + M i ϕ ic (2.30) ( k ) ϕ ic exp Ψ(γ ic ) Ψ γ il + 1 M i V x v ij log β cv, [i] N 1, [c] k 1,(2.31) M i l=1 where M i is the number of words in document x i. j=1 j=1 v=1 For fixed values of variational parameters γ and ϕ, maximizing the aggregate lower bound with respect to the model parameters yields the update equation for α and β. In particular, the update equation for α is the same as (2.21), and the update equation for β is given by: β cv N M i ϕ ic i=1 2.5 Experimental Results j=1 x v ij, [c] k 1, [v] V 1. (2.32) In this section, we present experimental results for mixed-membership models. results include three parts: (1) comparison between MMNB and NB 2 The, (2) comparison between Fast MM with MM, and (3) interesting properties in cluster assignments of MMNB. 2 In this section, we abuse the terminology by using naive Bayes (NB) models to refer to the standard NB or marginal NB as appropriate.

43 27 Table 2.1: The number of data points, features and classes in each UCI dataset. Dataset Ecoli Glass Iono Seg Segn Sona Vow Wdbc Wine Data points Features Classes Datasets Various datasets with different data types (real, integral, discrete, etc.) and different sparsity structures (full, sparse) are used in our experiments to show the versatility of MMNB and its variants. UCI Datasets: Nine datasets from UCI machine learning repository 3 are used for our experiments. These datasets are represented as real-valued full matrices without missing entries. The numbers of data points, features and classes in each dataset are listed in Table 2.1. Movielens: Movielens is a movie recommendation dataset created by the Grouplens Research Project 4. It contains 100,000 ratings for 1682 movies by 943 users represented as a sparse matrix, i.e., there are only 6.30% non-missing entries in the matrix. The ratings range from 1 to 5 with 5 being the best. Foodmart: Foodmart data comes with Microsoft SQL server. It contains transaction data for a fictitious retailer. In particular, there are 164,558 sales records for 7803 customers and 1559 products, i.e., there are only 1.35% non-missing entries in the matrix. Each customer record contains the number of each product bought by the customer. Jester: Jester is a joke rating dataset 5. The original dataset contains 4.1 million continuous ratings of 100 jokes from 73,421 users. The ratings range from -10 to 10 with 10 the best. We pick 1000 users who rate all 100 jokes and use this full data matrix in

44 our experiment. 28 For the experiments on LDA and its variants, we use 5 text datasets: Nasa: Nasa is a text dataset downloaded from Aviation Safety Reporting System (ASRS) online database. 6 This database contains aviation safety reports submitted by pilots, controllers and others. The dataset used is a subset of the whole database. It contains 4226 documents about the anomalies originating from three sources: flight crew, maintenance, and passengers. The vocabulary size is 604. Classic3: Classic3 [38] is a well known text dataset. It contains 3893 documents from three different classes including aeronautics, medicine and information retrieval. The vocabulary size is CMU Newsgroup: The CMU Newsgroup is also a benchmark text dataset [75]. The standard dataset of CMU Newsgroup contains 19,997 messages, collected from 20 different USENET newsgroups. We use three subsets in our experiments: (1) Diff is a collection of 3000 messages from 3 different newsgroups with 1000 messages for each class: alt.atheism, rec.sport.baseball and sci.space. The vocabulary size is (2) Sim is a collection of 3000 messages from 3 somewhat similar newsgroups with 1000 messages for each class: talk.politics.guns, talk.politics.mideast, talk.politics.misc. The vocabulary size is (3) Same is a collection of 3000 messages from 3 very similar newsgroups with 1000 messages for each class: comp.graphics, comp.os.ms-windows, comp.windows.x. The vocabulary size is Results for MMNB vs. NB In this section, we demonstrate the efficacy of MMNB through the comparison with NB on UCI, Jester, Foodmart and Movielens datasets. We use MMNB-Gaussian for UCI and Jester, MMNB-Poisson for Foodmart, and MMNB-Discrete for Movielens respectively. The results show that MMNB is applicable to different types of data and it achieves a better performance than NB. 6 Begin.aspx

45 Table 2.2: Perplexity of MMNB and NB on training and test sets of UCI. MMNB has a lower (better) perplexity on most of the datasets. (a) Training Set Dataset Ecoli Glass Iono Seg Segn Sona Vow Wdbc Wine NB ± ± ± ± ± ± ± ± ± MMNB ± ± ± ± ± ± ± ± ± p-value < 0.05 < 0.05 < 0.05 < 0.05 < 0.05 < 0.05 < 0.05 < (b) Test Set Dataset Ecoli Glass Iono Seg Segn Sona Vow Wdbc Wine NB ± ± ± ± ± ± ± ± ± MMNB ± ± ± ± ± ± ± ± ± p-value < 0.05 < 0.05 < 0.05 < 0.05 < 0.05 < 0.05 < 0.05 < Before we make comparison between MMNB and NB, we must note that the parameters of NB effectively has one fewer degree of freedom than MMNB. In particular, the k-dimensional Dirichlet parameter α in MMNB can be any non-negative vector, whereas the discrete distribution π in NB has to be a probability distribution summing up to one. In other words, there are k scalars to determine parameter α, but only k 1 scalars to determine the parameter π. For a generative model, a larger number of parameters may yield a better performance on the training set, such as a lower perplexity or a higher accuracy, since the model could be as complicated as necessary to fit the training data perfectly well. However, such complicated models typically lose the ability for generalization and lead to over-fitting on the test set. Therefore, in our experiments, we consider the comparison to be fair due to the following two reasons: First, MMNB and NB essentially have the same number of parameters, with NB having one fewer degree of freedom on the prior parameter. Second, we compare the performance on both training and test sets. If the over-fitting does occur to MMNB, it will lead to a bad performance on the test set. Thus the results on test sets are more interesting and crucial. We use perplexity as the measurement for comparison. The generative models are capable of assigning a log-likelihood log p(x i ) to each observed data point x i. Based on the log-likelihood scores, we compute the perplexity [61, 25] of the entire dataset X as

46 on training set on test set Perplexity Perplexity 3.5 on training set on test set Laplace Smoothing Parameter (a) NB Laplace Smoothing Parameter (b) MMNB Figure 2.5: Perplexities of NB and MMNB with k = 20 and varying ϵ on Movielens. With a larger smoothing parameter, perplexity decreases on the training set, and increases on the test set Perplexity Smoothing Number of Clusters Parameter NB MMNB Perplexity Number of Clusters NB MMNB Smoothing Parameter (a) Training set (b) Test set Figure 2.6: Perplexity surfaces of NB and MMNB over a range of k and ϵ on Movielens. MMNB mostly has a lower perplexity than NB, and a more stable performance on test set. { perplexity(x ) = exp N i=1 log p(x } i) N i=1 M, (2.33) i where M i is the number of observed features for x i and N is the number of data points. In the case of a full matrix such as the UCI data, M i is the number of features, which is the same for all data points. In the case of a sparse matrix such as Movielens, M i may be different for different data points. As shown in (2.33), the perplexity is a monotonically decreasing function of the log-likelihood, implying that lower perplexity is better (especially on the test set) since the model can explain the data better. Unless otherwise specified, we use 10-fold cross-validation with random initializations for MM models. In a 10-fold cross-validation, we divide the dataset evenly into 10 parts, one of which is picked as the test set, and the remaining 9 parts are used as the training

47 31 set. The process is repeated for 10 times, with each part used once as the test set. We then take the average of results over 10 folds on the training set and the test set respectively. For results on the training set, we train the model on training data by running variational EM as in Chapter to obtain the model parameters and variational parameters for calculating the perplexity. For results on test sets, given the model parameters from the training process, we run E-step (inference) on test data to obtain the variational parameters, and then calculate the perplexity. The average perplexity of MMNB and NB on UCI after a 10-fold cross-validation are listed in Table 2.2. The number of clusters we use is the actual number of classes given in the dataset. The p-value is from the paired t-test on whether the different between MMNB and NB is significant. It is clear that MMNB has a significantly lower perplexity than NB on most datasets, especially on the more important test-set results, indicating that MMNB fits the data better than NB. We run more comprehensive experiments on Movielens. Given a fixed number of classes (k=20), Figure 2.5 shows the perplexities of MMNB and NB with ϵ varied from 0.01 to 1, where ϵ is the Laplace smoothing parameter as introduced in Chapter for MMNB-Discrete case. The overall trend is as follows: when ϵ increases, the perplexity on the training set increases and the perplexity on the test set decreases. The result is consistent with the Bayesian intuition behind smoothing. In particular, a lower value of the Laplace smoothing parameter implies a high confidence on the parameters learnt from the training set. The learnt parameters will surely have a good performance on the training set itself, but does not necessarily perform well on the test set. On the other hand, larger value of the smoothing parameter implies a conservative approach, which may have restricted performance on the training set, but will perform reasonably well on the test set, especially if the training set is noisy or sparse. Therefore, we observe the ideal behavior one would expect as an effect of smoothing. Given a range of values for the number of clusters k and the smoothing parameter ϵ, the overall results for the entire (k, ϵ) range on training and test sets of Movielens are presented as perplexity surfaces in Figure The key observations are as follows: 1. For the training set results in Figure 2.6(a), the perplexity surface for MMNB 7 To give a better presentation of the results, the x and y axes do not run in a same direction in (a) and (b).

48 32 is almost always lower than that of NB over the entire range. NB tends to do marginally better than MMNB for a very large k and a very high ϵ. 2. Overall, the smoothing parameter has an adverse effect on the training set performance for both MMNB and NB. Both models tend to perform better on the training set with a larger number of latent classes and a smaller value of the smoothing parameter. 3. For the test set results in Figure 2.6(b), MMNB achieves a lower perplexity than that of NB for a smaller smoothing parameter. NB performs marginally better than MMNB for high values of the smoothing parameter. 4. The test set performance of MMNB is robust across the entire range of (k, ϵ), which highlights the stability of the model. 5. NB s test set performance for low ϵ values is poor, whereas the training set performance is good, which is a clear indication of over-fitting. Overall, MMNB demonstrates better performance on the training set and more robust and mostly better performance on the test set. Its stability on test set across different choices of parameters shows its modeling capabilities and makes it more suitable for real life tasks Results for Fast MM vs. MM In this section, we demonstrate the advantage of Fast MM compared to the MM in terms of running time and modeling performance measured by perplexity. In addition, for text datasets, we also generate the word lists for topics. The hypothesis is that the Fast MM would achieve a similar performance with MM, but it would be much more computationally efficient. We use text datasets for comparing Fast LDA and LDA, and use Jester, Movielens and Foodmart for comparing Fast MMNB and MMNB. The number of clusters on text data is the real number of classes, and the number of clusters on Jester, Movielens and Foodmart is 10. The comparisons of average perplexity and time over 10-fold crossvalidation are presented in Figure 2.7 and Table 2.3 respectively. The time shown in

49 33 Perplexity LDA Fast LDA Training set Perplexity Training set MMNB Fast MMNB Perplexity Nasa Classic3 Diff Sim Same LDA Fast LDA Test set 0 Nasa Classic3 Diff Sim Same Perplexity Jester Foodmart Movielens Test set MMNB Fast MMNB Jester Foodmart Movielens (a) LDA and Fast LDA (b) MMNB and Fast MMNB Figure 2.7: Perplexity of Fast MM compared to MM. Fast MM achieves similar perplexity with MM. Table 2.3: Running time (seconds) of Fast MM and MM. Fast MM is computationally more efficient than MM. (a) LDA and Fast LDA (b) MMNB and Fast MMNB Nasa Classic3 Diff Sim Same Dimension LDA ±1.13 ±4.68 ±28.20 ±80.85 ± Fast LDA ±0.07 ±0.18 ±2.66 ±8.37 ±3.15 Speedup times Jester Foodmart Movielens Dimension MMNB ±2.02 ±12.92 ±26.73 Fast MMNB ±2.48 ±3.78 ±5.01 Speedup times the figures is the sum of two parts: training a model from the training set and applying it to the test set to calculate the perplexity. From the comparison, we observe that Fast MM has a similar perplexity with MM on some datasets, and a mildly higher perplexity on others. The overall performance of these two models are close to each other. As for running time, the results provide the supportive evidence that Fast LDA is times faster than LDA, and Fast MMNB is 5-10 times faster than MMNB, which is a significant improvement in computational efficiency. According to the derived update equations for each model, the improvement is directly related to the dimensionality of the data. Since fast variational inference uses one ϕ per data point irrespective of the dimensionality and the regular variational inference uses a ϕ for each dimension of the data point, the speedups achieved by the fast variational inference are more significant in high dimensional data. However, the number of iterations in variational EM algorithm is also an important factor for the running time and it is not determined by the update equations.

50 Number of Data Points Number of Data Points Entropy Entropy (a) MMNB (b) Fast MMNB Figure 2.8: Histogram of cluster membership entropy on Glass for MMNB and Fast MMNB. We further investigate the cluster assignments of Fast MM. The cluster membership of each data point could be considered as its probability belonging to different clusters. If we calculate the Shannon entropy of the cluster membership, a high entropy indicates a real mixed membership assignment, while a low entropy implies almost a sole membership. Figure 2.8 shows the histograms of cluster membership entropy of MMNB and Fast MMNB on Glass, where each bar denotes the number of data points falling into that range of entropy. While most data points from MMNB have a large entropy over different ranges, the data points from Fast MMNB mostly have a small entropy. Such results also hold for LDA and Fast LDA on text data. The interesting observation indicates that fast variational inference actually generates somewhat sole membership while the regular variational inference generates real mixed membership. One possible reason for the sole membership from fast variational inference is as follows: In the E-step, MMNB iterates through (2.13) and (2.14), while Fast MMNB iterates through (2.23) and (2.24). The expression for γ in (2.13) contains the summation of ϕ j over all features j. Since each ϕ j may take different values, in the sense that each ϕ j may peak at different component, the summation of ϕ j may have several peaks on different components. Accordingly, γ will also have several peaks, leading to a mixed membership over those peaked components. In comparison, the expression for γ in (2.23) has a term of mϕ instead, so no matter which component ϕ peaks at, the peak will be greatly enhanced in γ, and such enhancement in γ will further increases ( ( k )) the sole membership nature of ϕ through the term exp Ψ(γ ic ) Ψ l=1 γ il in (2.24). By iterating through γ and ϕ, the accumulated enhancement finally leads to almost a sole membership on the peaked component. Figure 2.9 shows the posterior

51 Probability Probability Probability Component Component Component (a) MMNB - beginning (b) MMNB - middle (c) MMNB - end Probability Probability Probability Component (d) Fast MMNB - beginning Component (e) Fast MMNB - middle Component (f) Fast MMNB - end Figure 2.9: Posterior over 6 components for one data point in Glass at the beginning, middle and the end of an E-step in MMNB and Fast MMNB. Both algorithms start from an almost uniform distribution, but MMNB ends up with a bimodal distribution showing a mixed membership over two peaked components, and Fast MMNB ends up with a unimodal distribution showing an almost sole membership on the peaked component. Similar results are observed on other datasets. of one data point in Glass at different stages (beginning, middle, and the end) of an E-step from MMNB and Fast MMNB, where each bar c shows the probability of the data point belonging to component c among 6 components in total. We can see that at the beginning, both MMNB and Fast MMNB have an almost uniform posterior distribution. As the algorithm runs, the posterior gradually shows some peaks. MMNB finally gives a bimodal distribution (or it could be a multimodal distribution in other examples), and Fast MMNB gives a unimodal distribution. For our experiments, we used a strict stopping criterion for the E-step, which leads to almost sole membership. An early stopping strategy for Fast MMNB may gives a mixed membership. The above argument also works for LDA and Fast LDA. We use 5% of the data as initialization, and run Fast LDA and LDA on the whole data sets to get word lists of topics for text data in Table , where the words are listed with decreasing probabilities in each topic. The observations are as follows: First, both LDA and Fast LDA generate appropriate word lists for the topics. For most of

52 Table 2.4: Word list for three topics on Nasa. The word lists from LDA and Fast LDA are qualitatively similar. Topic 1 is flight crew, Topic 2 is maintenance, and Topic 3 is passenger. (a) LDA Topic 1 Topic 2 Topic 3 runway aircraft passenger approach maintenance flight aircraft engine attendant departure zzz captain altitude flight seat turn minimum equipment list told time check asked air traffic control fuel back flight time attendants tower gear aircraft (b) Fast LDA Topic 1 Topic 2 Topic 3 runway aircraft passenger aircraft maintenance flight approach flight attendant flight engine capt departure minimum equipment list told time zzz seat alt check asked turn time aircraft landing control back air traffic control crew attendants 36 the datasets, we can map each list to the given classes. For example, in the result on Nasa, Topic 1 is flight crew, Topic 2 is maintenance, and Topic 3 is passenger. Second, for Nasa, Classic3, and Diff, the datasets with distinct classes of documents, the topic lists generated from Fast LDA and LDA are very similar, even for the rank of words in each topic. For Sim, the dataset with somewhat similar classes of documents, Topic 1 and 3 from Fast LDA and LDA are still similar, and there is some difference on Topic 2. The difference is probably because the corresponding class of that topic is talk.politics.misc, so it covers several different aspects, which could be extracted in different ways. Despite the difference, the topics generated from Fast LDA and LDA are both qualitatively reasonable/good lists. Finally, for Same, the dataset with very similar classes of documents, the difference between Fast LDA and LDA is more distinct. While we can approximately map three topics from LDA to comp.windows.x, comp.graphics, and comp.os.ms-windows respectively, Fast LDA seems to have comp.graphics in both Topic 2 and 3. Therefore, we believe that LDA performs marginally better than LDA in this case. Given the above observations, we draw a tentative conclusion in terms of LDA and Fast LDA s topic modeling performance: If the dataset contains several distinct classes with each document belonging to one, the sole membership generated by Fast LDA is good enough for such datasets, and Fast LDA usually gives very similar topic lists with LDA on such data. When the classes of documents become similar,

53 Table 2.5: Word list for three topics on Classic3. The word lists from LDA and Fast LDA are qualitatively similar. Topic 1 is information retrieval, Topic 2 is medicine, and Topic 3 is aeronautics. (a) LDA (b) Fast LDA Topic 1 Topic 2 Topic 3 information patients flow library cells boundary system cases pressure data normal layer libraries growth number research blood mach systems found results retrieval treatment theory science children heat scientific cell method Topic 1 Topic 2 Topic 3 information patients flow library cells boundary system cases pressure libraries normal layer data growth number research blood mach retrieval treatment results systems found theory science children shock scientific cell heat 37 Table 2.6: Word list for three topics on Diff. The word lists from LDA and Fast LDA are qualitatively similar. Topic 1 is alt.atheism, Topic 2 is sci.space, and Topic 3 is rec.sport.baseball. (a) LDA Topic 1 Topic 2 Topic 3 god space year people earth game don nasa don time launch team good orbit baseball religion system good make shuttle time objective moon games point time hit evidence mission players (b) Fast LDA Topic 1 Topic 2 Topic 3 god space year people earth game don nasa don religion launch team time time baseball objective orbit good good system games moral don time make shuttle hit point moon players each document tends to have a mixed membership over different topics, then the sole membership from Fast LDA may not be good enough to extract the topic lists. However, as we can see from the examples on Sim and Same, such degeneration of topic modeling performance only happens when the classes are very similar or almost the same Results for Cluster Assignments of MMNB To obtain a better understanding of MMNB s behavior, we run more experiments on UCI data to study the relationship between the cluster assignments and modeling performance. Although each data point in the UCI dataset only belongs to one cluster, the

54 Table 2.7: Word list for three topics on Sim. The word lists from LDA and Fast LDA are qualitatively similar. Topic 1 is talk.politics.guns, Topic 2 is talk.politics.misc, and Topic 3 is talk.politics.mideast. (a) LDA Topic 1 Topic 2 Topic 3 people people people gun don israel don government armenian government rights turkish fbi men jews guns make armenians fire law don law political israeli time gay government batf free time (b) Fast LDA Topic 1 Topic 2 Topic 3 people people people gun president israel don don armenian fbi government turkish guns make jews fire stephanopoulos armenians government states israeli koresh time jewish time state war law health armenia 38 Table 2.8: Word list for three topics on Same. LDA generates better topic lists. Topic 1 is comp.windows.x, Topic 2 is comp.graphics, and Topic 3 is comp.os.ms-windows. (a) LDA Topic 1 Topic 2 Topic 3 lib file max expose image windows event graphics dos dpy program card xmu window file libxmu ftp bhj twm files win undefined jpeg giz key data run mydisplay software system (b) Fast LDA Topic 1 Topic 2 Topic 3 entry bit max window card windows entries image file program colour image file graphics dos widget ati program rules ultra graphics info images files section windows windows build conference don cluster assignments from MMNB still conveys interesting information. The cluster membership entropy indicates the degree of mixed membership. From another perspective, it also shows the model s degree of confidence when each data point only belongs to one cluster as in UCI. A low entropy implies almost a hard clustering, hence the model s high confidence of the clustering assignments. A high entropy implies a mixed-membership assignment to multiple clusters, hence the model s low confidence of the clustering assignments. Therefore, we can learn the relationship between MMNB s confidence in cluster assignments and its modeling performance. In particular, we use the cluster membership entropy to measure the degree of confidence in cluster assignments, and use the test-set perplexity to measure the model s accuracy.

55 39 Perplexity % 20 40% 40 60% 60 80% % Percentage of Data with Ascending Entropy (a) Ecoli Perplexity % 20 40% 40 60% 60 80% % Percentage of Data with Ascending Entropy (b) Glass Perplexity % 20 40% 40 60% 60 80% % Perentage of Data with Ascending Entropy (c) Iono Perplexity 1.5 Perplexity 1.5 Perplexity % 20 40% 40 60% 60 80% % Percentage of Data with Ascending Entropy (d) Seg % 20 40% 40 60% 60 80% % Percentage of Data with Ascending Entropy (e) Segn % 20 40% 40 60% 60 80% % Percentage of Data with Ascending Entropy (f) Sona Perplexity Perplexity Perplexity % 20 40% 40 60% 60 80% % Percentage of Data with Ascending Entropy (g) Vow % 20 40% 40 60% 60 80% % Percentage of Data with Ascending Entropy (h) Wdbc % 20 40% 40 60% 60 80% % Percentage of Data with Ascending Entropy (i) Wine Figure 2.10: Perplexities with ascending cluster membership entropy on UCI data. Perplexities increase with ascending entropy on most of datasets. The hypothesis is that the more confident the model is on the test set, the higher accuracy it achieves. The experiment runs as follows: we sort all the data points in the test sets in ascending order of their cluster membership entropy, and divide the test sets evenly into five parts according to the ascending entropy, i.e., the first part contains the first 20% data points with the lowest entropy, the second part contains the second 20% data points with the second lowest entropy, and so on. We then calculate the perplexities on these five parts separately and draw a perplexity curve. Figure 2.10 shows the curves as an average of 10-fold cross-validation on 9 UCI datasets. Interestingly, we can see that the perplexity increases monotonically with ascending cluster membership entropy on almost all datasets. Since higher perplexity indicates lower accuracy, and higher cluster

56 40 membership entropy indicates lower confidence, the observation could be rephrased as: model s accuracy decreases monotonically with the model s confidence, i.e., the less confidence the model has, the worse performance it gets. Therefore, the hypothesis is verified. It is a useful property to help us understand the result from MMNB. 2.6 Conclusion In this Chapter, we propose a family of mixed-membership naive Bayes models. Such models extend the popular naive Bayes models to work with sparse observations, by marginalizing over all missing features. In addition, they take advantage of the machinery of hierarchical Baysian modeling to allow NB models to generate mixed-memberships for the data points. [25] had suggested that such an extension will be possible due to the modularity of latent Dirichlet allocation (LDA). We demonstrate how powerful such an extension can be in the context of NB models, while advancing the state-of-the-art on NB as well as LDA. Moreover, the new fast variational inference algorithms ensure the scalability of MMNB models. When applied in the context of topic modeling, the same ideas lead to a substantially more efficient algorithm for LDA. Extensive experiments on a variety of datasets demonstrate that MMNB has a better performance than NB in terms of predictive perplexity and stability. Further, Fast MM exhibits a substantial improvement in computational efficiency compared to MM, with no noticeable loss quantitatively and qualitatively.

57 Chapter 3 Discriminative Mixed-membership Models As mixed-membership (MM) models, latent Dirihclet allocation (LDA) and mixed membership naive Bayes (MMNB) achieve good performance in clustering: they yield high clustering accuracy, and also generate mixed-membership vectors which serve as a succinct and interpretable representation of otherwise large and high dimensional data points. However, one important restriction of most existing mixed-membership models is that they are unsupervised models and cannot leverage class label information for classification. Meanwhile, most popular classification algorithms, such as support vector machines (SVM) [26] and logistic regression (LR) [99], perform well on classification, but the classifier itself is often hard to interpret. Therefore, an accurate discriminative classification algorithm leveraging mixed-membership models for interpretability is highly desirable. Supervised latent Dirichlet allocation (SLDA) [24] is such a mixed-membership model which takes response variables into account, but it has two limitations preventing it from being used as a classification algorithm: 1. The response variables in SLDA are univariate real numbers assumed to be generated from a normal linear model, whereas the response variables, i.e., labels, are discrete categories in the classification setting. Although the authors pointed out that the response variables can be of various types obtained from generalized 41

58 42 Supervised classification For text classification For non-text classification DLDA Fast DLDA DMMNB Fast DMMNB (a) DMM Fast DMM Acronym DLDA Fast DLDA DMMNB Fast DMMNB DMM Fast DMM Full name Discriminative latent Dirichlet allocation Fast discriminative latent Dirichlet allocation Discriminative mixed-membership naïve Bayes Fast discriminative mixed-membership naïve Bayes Discriminative mixed-membership models Fast discriminative mixed-membership models Figure 3.1: An overview of discriminative mixed-membership models. (a) is the relationship and structure of the models. (b) is the acronyms of the models. (b) linear models, variational inference is difficult in the general case. While a Taylor expansion is recommended [24] to obtain an approximation of the log-likelihood, such an approach forgoes the lower bound guarantee of variational inference. 2. Like latent Dirichlet allocation (LDA), SLDA is designed for text data as a collection of homogeneous tokens. However, most non-text classification tasks, e.g., the UCI benchmark datasets, have features of heterogeneous types with measured values. SLDA is not designed for such data. In this chapter, we propose discriminative mixed-membership (DMM) models as a classification algorithm by combining multi-class logistic regression with unsupervised MM models. In particular, we consider two variants discriminative latent Dirichlet allocation (DLDA) and discriminative mixed-membership naive Bayes (DMMNB). DLDA is applicable to text classification and uses latent Dirichlet allocation (LDA) [25] as the underlying MM model. DMMNB is applicable to non-text classification involving different types (e.g., numerical, categorical) of feature vectors and uses mixed-membership naive Bayes (MMNB) as the underlying MM model. Following the fast variational inference in Chapter 2.4, we also have Fast DMM models, viz., Fast DMMNB and Fast DLDA. In experiments, we show that Fast DMMNB and Fast DLDA achieve higher accuracy than unsupervised MM models, as well as higher/competitive performance compared to the state of the art classification algorithms. An overview of the DMM models and the acronyms are given in Figure 3.1(a) and 3.1(b) respectively. Recent years have seen emergence of work on extending mixed membership to supervised learning settings. Supervised LDA (SLDA) [24] combines LDA with a real-valued

59 43 response variable. [45] proposes a Bayesian model for natural scene categorization. [48] proposes labeled latent Dirichlet allocation to incorporate functional annotation of known genes to guide gene clustering. [74] proposes DiscLDA which determines document position on topic simplex with guidance of labels. [89] proposes a Dirichletmultinomial regression which accommodates different types of metadata, including labels. [129] proposes a correlated labeling model for multi-label classification. [128] extends SLDA for image classification and annotation. 3.1 Discriminative LDA Assuming there are t classes and k components, the graphical model for DLDA is given in Figure 3.2(a). It is similar with LDA except that it generates the label y other than the document x through logistic regression with parameter η = {η 1,..., η t }, where each η s for [s] t 1 ([s]t 1 s = 1... t) is a k-dimensional vector and η t is a zero vector by default. The generative process for each document x and label y is given as follows: 1. Choose a mixed-membership vector π Dirichlet(α). 2. For each of M words (x j, [j] M 1 ) in the document x, (a) Choose a component z j = c discrete(π). (b) Choose a word x j discrete(β c ). 3. Choose the label from a multi-class logistic regression y LR(η T 1 z, ηt 2 z,..., ηt t z). z is an average of z 1... z M over all observed words. Note that each z j in LDA could be represented as a k-dimensional unit vector with only the c th entry being 1 if it denotes the c th component, so z is also a k-dimensional vector. LR(η T 1 z, ηt 2 z,..., ηt t z) denotes a logistic transformation on [η1 T z, ηt 2 z,..., ηt t z], which is equivalent to a discrete distribution (p 1,...p t 1, 1 t 1 s=1 p s) with p s = exp(ηt s z) for [s]t 1 z) 1. In two-class 1+ t 1 s=1 (ηt s 1 classification, y is 0 or 1 generated from Bernoulli( 1+exp( η1 T parameter η 1 to be estimated, η 2 is the zero vector by default. z)), i.e., there is only one There are two important properties of DLDA, and of DMM models in general: (1) The k-dimensional mixed membership vector z effectively serves as a low dimensional representation of the original document. While z in LDA is inferred in an unsupervised

60 44 β θ k k M α π z x M α π z x M η t y N η t y N (a) DLDA (b) DMMNB Figure 3.2: Graphical models for DLDA and DMMNB. way, it is obtained from a supervised dimensionality reduction in DLDA. (2) DLDA allows the number of classes t and the number of components k in the generative model to be different. If k was forced to be equal to t, for problems with a small number of classes, z would have been a rather coarse representation of the document. In particular, for two-class problems, z would lie on the 2-simplex, which may not be an informative representation for classification purposes. Decoupling the choice of k from t prevents such pathologies. In principle, we may find a proper k using Dirichlet process mixture models [21]. From the generative model, the density function for (x, y) is given by: M k p(x, y α, β, η) = p(π α) p(z j = c π)p(x j β c ) p(y z, η)dπ. (3.1) π j=1 c=1 The probability of the entire dataset of N documents and labels (X = {x i, [i] N 1 }, Y = {y i, [i] N 1 }) is given by N M i p(x, Y α, β, η) = p(π i α) π i i=1 j=1 c=1 where M i is the total number of words in document i. k p(z ij = c π i )p(x ij β c ) p(y i z i, η)dπ i, (3.2) 3.2 Discriminative MMNB Discriminative MMNB is similar with DLDA except that it is applicable to non-text data and it keeps separate distributions for each feature, as in MMNB. Given the graphical

61 model in Figure 3.2(b), the generative process for the data point x and label y is as follows: 1. Choose a mixed-membership vector π Dirichlet(α). 2. For each non-missing feature j in x, (a) Choose a component z j = c discrete(π). (b) Choose a feature value x j p ψj (x j θ jc ) 3. Choose the label from a multi-class logistic regression y LR(η T 1 z, ηt 2 z,..., ηt t z). Here ψ j component c. and θ jc jointly decide an exponential family distribution for feature j and In both DLDA and DMNB, following [24], we have used z (the mean of z for all words/features) as an input to logistic regression. In principle, any other transformation of z could work, as long as it gives a reasonable representation of the original data point. We choose z due to the following two reasons: (1) Optimality: Given a set of data points, their best representative is always the mean according to a wide variety of divergence functions [12, 8]. We also notice that ηs T z = ηt s E[z] = E[ηs T z] for [s] t 1, which means that if we take the mean of ηs T z on each feature as the input to logistic function, it is equivalent to using ηs T z as the input to logistic function. (2) Simplicity. Since z is the latent variable, if we use other complicated transformation on z such as a non-linear function, it would greatly increase the difficulty in inference and learning. The density function for (x, y) is given by M k p(x, y α, Θ, η) = p(π α) p(z j = c π)p ψj (x j θ jc ) p(y z, η)dπ. (3.3) π j=1 x j c=1 The probability of the entire dataset of n documents and labels (X = {x i, [i] N 1 }, Y = {y i, [i] N 1 }) is given by N M p(x, Y α, Θ, η) = p(π i α) π i i=1 j=1 x ij k p(z ij = c π i )p ψj (x ij θ jc ) p(y i z i, η)dπ i. c=1 45 (3.4)

62 Like MMNB, two special cases of DMMNB are DMMNB-Gaussian and DMMNB- Discrete. The density functions corresponding to (2.7) and (2.8) are given by p(x, y α, Ω, η) (3.5) ( ) M k = p(π α) 1 p(z j = c π) exp (x j µ jc ) 2 π c=1 2πσjc 2 2σjc 2 p(y z, η)dπ j=1 x j for DMMNB-Gaussian and p(x, y α, Ω, η) = for DMMNB-Discrete. π M p(π α) j=1 x j 46 k p(z j = c π)p jc (x j ) p(y z, η)dπ (3.6) c=1 3.3 Inference and Parameter Estimation Since DMM models assume a generative process for both labels as well as the data points, instead of using labels directly to train a classifier, we use both X and Y as samples from the generative process to estimate the parameters of DMM models such that the likelihood of observing (X, Y) is maximized. In particular, we use variational inference and fast variational inference similar as in Chapter 2.3 and 2.4. For each data point, to obtain a tractable lower bound to log p(x, y α, Λ, η) 1, we introduce a variational distribution q = q 1 as in (2.10) or q = q 2 as in (2.22) to approximate the true posterior distribution p(π, z α, Λ, η) over the latent variables. By a direct application of Jensen s inequality [25], the lower bound to log p(x, y α, Λ, η) is given by: log p(x, y α, Λ, η) E q [log p(π, z, x, y α, Λ, η)] + H(q(π, z)). (3.7) Denoting the lower bound for each data point (x i, y i ) with L(γ i, ϕ i ; α, Λ, η), we have L(γ i, ϕ i ; α, Λ, η) =E q [log p(π i α)] + E q [log p(z i π i )] + E q [log p(x i z i, Λ)] (3.8) E q [log q(π i γ i )] E q [log q(z i ϕ i )] + E q [log p(y i z i, η)]. 1 Λ denotes β for DLDA and Θ for DMMNB.

63 As in MM models, using q = q 1 yields the regular DMM models and using q = q 2 yields Fast DMM models. We will use q to denote q 1 or q 2 without differentiation unless otherwise necessary. Given the variational distribution q, the first five terms in (3.8) are exactly the same with the corresponding MM models. The most difficult part is the last term, which cannot be computed exactly even after introducing the variational distribution q, so further approximation is needed. We give the expression for the last term here, and the details of derivation could be found in Appendix A.4. For DLDA, we have E q [log p(y i z i, η)] 1 M i M i and for DMMNB, we have E q [log p(y i z i, η)] 1 M i M j=1 c=1 j=1 x ij ( k t 1 ) ϕ ijc η sc y is 1 t 1 exp(η sc ) + (1 1 log ξ i ), ξ i ξ i s=1 s=1 47 (3.9) ( k t 1 ) ϕ ijc η sc y is 1 t 1 exp(η sc ) + (1 1 log ξ i ), ξ i ξ i c=1 s=1 s=1 (3.10) where ξ i > 0 is a new introduced variational parameter. Also, for both Fast DLDA and Fast DMMNB, we have ( k t 1 ) E q [log p(y i z i, η)] ϕ ic η hc y is 1 t 1 exp(η sc ) + (1 1 log ξ i ). (3.11) ξ i ξ i c=1 s=1 Maximizing the lower-bound function L(γ i, ϕ i, ξ i, α, Λ, η) with respect to the variational parameters gives the update equation of γ i, ϕ i and ξ i as in Table 3.1. The average of ϕ ij over all existent x ij of x i in DMM, or ϕ i in Fast DMM, gives the posterior of z, i.e., the low-dimension representation of each data point. In Table 3.1, note that the last term in all expressions of ϕ contains y, showing that the low-dimensional representation not only depends on x, but also depends on y, which means that DMM models achieve supervised dimensionality reduction. Removing the last term gives the expression of ϕ in the corresponding unsupervised settings. Variational parameters (ϕ i, γ i, ξ i ) from the inference step gives the optimal lower bound to the log-likelihood of (x i, y i ), and maximizing the aggregate lower bound N i=1 L(ϕ i, γ i, ξ i, α, Λ, η) over all data points with respect to α, Λ and η respectively s=1

64 48 DLDA Fast DLDA DMMNB Fast DMMNB Table 3.1: Updates for variational parameters in DMM and Fast DMM. (a) Updates for ϕ ( ϕ ijc exp Ψ(γ ic ) Ψ( k l=1 γ il) + V v=1 xv ij log β cv + 1 t 1 M i ( ϕ ic exp ϕ ijc exp Ψ(γ ic ) Ψ( k l=1 γ il) + 1 M i Mi j=1 ( Ψ(γ ic ) Ψ( k l=1 γ il) + ( (xij µjc)2 log 2σjc 2 ( ϕ ic exp Ψ(γ ic ) Ψ( k l=1 γ il) + 1 M ( (x M i j=1, x ij ij µ jc) 2 log 2σjc 2 ) s=1 (η scy is exp(η sc )/ξ i ) V v=1 xv ij log β cv + 1 ) t 1 M i s=1 (η scy is exp(η sc )/ξ i ) 2πσjc) t 1 M i ) s=1 (η scy is exp(η sc )/ξ i ) 2πσjc) ) t 1 M i s=1 (η scy is exp(η sc )/ξ i ) (b) Updates for γ (c) Updates for ξ DLDA γ ic = α c + M i j=1 ϕ ijc DLDA ξ i = t 1 k Mi M i s=1 c=1 j=1 ϕ ijc exp(η sc ) Fast DLDA γ ic = α c + M i ϕ ic Fast DLDA ξ i = 1 + t 1 s=1 k c=1 ϕ ic exp(η sc ) DMMNB γ ic = α c + M j=1, x ij ϕ ijc DMMNB ξ i = M i t 1 s=1 k c=1 M j=1, x j ϕ ijc exp(η sc ) Fast DMMNB γ ic = α c + M i ϕ ic Fast DMMNB ξ i = 1 + t 1 s=1 k c=1 ϕ ic exp(η sc ) yields the estimated parameters. The estimations of α and Λ are the same as in the corresponding MM models. As for η, we have for DMM models, and for Fast DMM models. η sc = log N Mi i=1 N i=1 η sc = log j=1 y isϕ ijc /M i Mi j=1 ϕ ijc/(m i ξ i ), [c]k 1, [s] t 1 1 N i=1 y isϕ ic N i=1 ϕ ic/ξ i, [c] k 1, [s] t 1 1 Given the updates for variational and model parameters, a variational EM algorithm could be constructed to optimize the lower bound to the log-likelihood function over variational parameters (ϕ i, γ i, ξ i ) in the E-step, and over the model parameters (α, Λ, η) in the M-step respectively until convergence. The objective function is guaranteed to be non-decreasing. 3.4 Experimental Results In this section, we present experimental results for discriminative mixed membership models. The results includes three parts: (1) comparing (Fast) DMM to corresponding (Fast) MM models. (2) comparing Fast DMM with other state of the art classification

65 49 Table 3.2: Accuracy for LDA and DLDA (k=t) on text data. Fast DLDA has a higher accuracy on all datasets. Nasa Classic3 Diff Sim Same LDA ± ± ± ± ± DLDA ± ± ± ± ± Fast LDA ± ± ± ± ± Fast DLDA ± ± ± ± ± Table 3.3: Accuracy for MMNB and DMMNB (k=t) on UCI data. Fast DMMNB has a higher accuracy on most of the datasets. Ecoli Glass Iono Seg Segn Sona Vow Wdbc Wine MMNB ± ± ± ± ± ± ± ± ± DMMNB ± ± ± ± ± ± ± ± ± Fast MMNB ± ± ± ± ± ± ± ± ± Fast DMMNB ± ± ± ± ± ± ± ± ± algorithms such as support vector machine and logistic regression. (3) presenting the topic lists generated by DLDA. The data sets we use are the same UCI data sets and text data sets as in Chapter 2.5 for (Fast) DMMNB and (Fast) DLDA respectively DMM vs. MM We first compare (Fast) DMM models to corresponding (Fast) MM models. For initialization, the model parameters are initialized using all data points and their labels in the training set, in particular, we set the number of components k to be the number of classes t; use the mean and standard deviation (for Gaussian case only) of the data points in each class to initialize Λ; and use N s /N to initialize each dimension of α, where N s is the number of data points in class s and N is the total number of data points. For η in DMM we run a cross validation by holding out 10% of training data as the validation set and use the parameters generating the best results on the validation set. In particular, each η s of [s] t 1 1 in η takes value of ru s, where u s is a unit vector with the s th dimension being 1 and others being 0, and r takes values from 0 to 100 in steps of 10. In principle, MM models are not used for classification, but given the

66 50 Table 3.4: Running time (seconds) of DLDA and Fast DLDA on text data. Fast DLDA is computationally more efficient than DLDA. Nasa Classic3 Diff Sim Same Dimension DLDA ± ± ± ± ± Fast DLDA 3.63± ± ± ± ±5.83 Speedup times initialization we have introduced, there is a one-to-one mapping between the component and the class, hence we can measure the accuracy. The results for DLDA and DMMNB from 10-fold cross validation are presented in Table 3.2 and 3.3 respectively. The observations are as follows: 1. On text data, Fast DLDA has a higher accuracy than DLDA. On UCI data, Fast DMMNB generally also has a higher accuracy than DMMNB, with a few exceptions. In Chapter 2.5.3, we have seen that Fast MM achieves a similar performance with MM in clustering, but not as good as it. However, when it comes to classification, Fast DMM has higher accuracy than DMM, making the fast variational inference more advantageous. The possible reason for Fast DMM s better classification performance is as follows: The generative model for (Fast) DMM generates both the feature vectors (X ) and labels (Y). Since Fast DMM uses a simpler variational distribution to model X, it emphasizes more on modeling the label Y, which is more important for a classification task. 2. While DMM models are not necessarily better than MM models, Fast DMM models are almost always better than Fast MM models. Overall, Fast DMM models achieve the highest accuracy among four algorithms. The higher accuracy of Fast DMM demonstrates the effects of logistic regression in accommodating label information for DMM models. As in the unsupervised case, DMM and Fast DMM models generate mixed membership and sole membership respectively. The result of accuracy shows that the sole membership seems to be more helpful than the mixed membership in terms of classification accuracy. The possible reason is that in (single-label) classification scenario, each data point only belongs to one class, hence the sole membership from Fast DMM would

67 51 Table 3.5: Running time (seconds) of DMMNB and Fast DMMNB on UCI data. Fast DMMNB is computationally more efficient than DMMNB. Ecoli Glass Iono Seg Segn Sona Vow Wdbc Wine Dimension DMMNB ±1.13 ±0.49 ±3.11 ±77.27 ±3.55 ±4.51 ±8.73 ±0.40 ±0.25 Fast DMMNB ±0.39 ±0.21 ±0.01 ±6.32 ±0.63 ±0.07 ±0.44 ±0.11 ±0.04 Speedup times probably be more appropriate. We compare the running time between DMM and Fast DMM. The results for DLDA and DMMNB are presented in Table 3.4 and 3.5 respectively. In Table 3.5, although most of datasets are small, Fast DMMNB is already faster than DMMNB. Fast DMM s advantage increases when it comes to the larger and higher-dimensional text data as in Table 3.4, where Fast DLDA is about 20 to 150 times faster than DLDA, showing Fast DMM models significant superiority in terms of time efficiency, which is consistent with the results in Chapter Therefore, Fast DMM models are generally more accurate and substantially faster than DMM models Fast DMM vs. Other Classification Algorithms Since Fast DMM models have better performance than DMM models, in this subsection, we use Fast DMM to compare with other classification algorithms. In particular, we compare Fast DMMNB with support vector machine (SVM) [29], logistic regression (LR) and naive Bayes classifier (NBC) 2 on UCI data; and compare Fast DLDA with SVM, NBC, LR and mixture of von Mises-Fisher (vmf) model [11] on text data. Since DMM is a combination of logistic regression and mixed-membership model, we also compare the results from DMM to the results from MM and logistic regression in two steps sequentially. For Fast DMM models, we run the experiments with an increasing k. In particular, for Fast DMMNB, we use k = (t, t + 5, t + 10), and for Fast DLDA, we use k = (t, t + 15, t+30, t+50, t+100). For initialization of Λ, we use the mean and standard deviation 2 Different from Chapter 2.5.2, naive Bayes used in this subsection is the classifier.

68 52 Table 3.6: Accuracy on text data for Fast DLDA and other classification algorithms. DLDA has higher accuracy on most datasets. Fast Nasa Classic3 Diff Sim Same Fast DLDA (k=t) ± ± ± ± ± Fast DLDA (k=t+15) ± ± ± ± ± Fast DLDA (k=t+30) ± ± ± ± ± Fast DLDA (k=t+50) ± ± ± ± ± Fast DLDA (k=t+100) ± ± ± ± ± vmf ± ± ± ± ± NBC ± ± ± ± ± LR ± ± ± ± ± SVM ± ± ± ± ± Table 3.7: Accuracy on UCI data for Fast DMMNB and other classification algorithms. Fast DMMNB has a higher accuracy, except SVM. Ecoli Glass Iono Seg Segn Sona Vow Wdbc Wine Fast DMMNB (k=t) ± ± ± ± ± ± ± ± ± Fast DMMNB (k=t+5) ± ± ± ± ± ± ± ± ± Fast DMMNB (k=t+10) ± ± ± ± ± ± ± ± ± NBC ± ± ± ± ± ± ± ± ± LR ± ± ± ± ± ± ± ± ± SVM ± ± ± ± ± ± ± ± ± (for Gaussian case only) of the training data in given classes plus some perturbation if k > t; for α, we set it to be 1/k on each dimension; and for η, we again use a cross validation as in Chapter For SVM, we use linear and RBF kernel with same cross validation strategy on the penalty parameter and the kernel parameter (for RBF only) taking values from 10 5 to 10 5 in multiplicative steps of 10 respectively. The results for Fast DLDA and DMMNB are presented in Table 3.6 and 3.7. The top parts of the tables are the results from the generative models, and the bottom parts are the results from discriminative classification algorithms. For SVM, we report the highest accuracy of linear and RBF kernels with different parameters. We use bold for the best results among the generative models and use bold and italic for the best results

69 53 Table 3.8: Accuracy on text from Fast LDA+LR and Fast DLDA with different choices of k, where Fast LDA+LR is using Fast LDA and logistic regression in two steps. Fast DLDA achieves higher accuracy, indicating the advantage of supervised dimension reduction. k=t k=t+15 k=t+30 k=t+50 k=t+100 Nasa Classic3 Diff Sim Same Fast LDA+LR ± ± ± ± ± Fast DLDA ± ± ± ± ± Fast LDA+LR ± ± ± ± ± Fast DLDA ± ± ± ± ± Fast LDA+LR ± ± ± ± ± Fast DLDA ± ± ± ± ± Fast LDA+LR ± ± ± ± ± Fast DLDA ± ± ± ± ± Fast LDA+LR ± ± ± ± ± Fast DLDA ± ± ± ± ± Table 3.9: Accuracy on UCI from Fast MMNB+LR and Fast DMMNB with different choices of k, where Fast MMNB+LR is using Fast MMNB and logistic regression in two steps. Fast DMMNB achieves higher accuracy, indicating the advantage of supervised dimension reduction. k=t k=t+5 k=t+10 Ecoli Glass Iono Seg Segn Sona Vow Wdbc Wine Fast MMNB LR ± ± ± ± ± ± ± ± ± Fast DMMNB ± ± ± ± ± ± ± ± ± Fast MMNB LR ± ± ± ± ± ± ± ± ± Fast DMMNB ± ± ± ± ± ± ± ± ± Fast MMNB LR ± ± ± ± ± ± ± ± ± Fast DMMNB ± ± ± ± ± ± ± ± ± among all algorithms. Three parts of information could be read from the tables: 1. Overall, on text datasets, Fast DLDA does better than all other algorithms, including SVM, on almost all datasets, which is a promising result although more rigorous experiments may be needed to make a further investigation; on UCI datasets, Fast DMMNB also achieves higher accuracy than all other algorithms on most of datasets except SVM, which beats Fast DMMNB five out of nine times. 2. The better performance of Fast DMM models compared to LR on original datasets indicates that the low-dimensional representation from Fast DMM helps the classification.

70 54 Table 3.10: Extracted Topics from Nasa dataset using Fast DLDA. Topic 1 Topic 2 Topic 3 Topic 4 runway maintenance passenger passenger aircraft aircraft flight flight approach flight attendant medical tower minimum equipment list told attendant cleared time captain emergency landing check seat aircraft airport engine asked doctor turn mechanical back landing taxi installed attendants attendants traffic part aircraft captain final inspection lavatory oxygen controller work crew paramedics 3. Interestingly, for Fast DMMNB, the accuracy increases monotonically with k from t to t + 10 on most of the datasets. For Fast DLDA on text data, an increasing of accuracy with a larger k is also observed, although the result goes up and down without a clear trend. One possible reason for the increasing accuracy is as follows: When k is too small, we are performing a drastic dimension reduction to represent each data point in a k-dimensional mixed-membership representation, which may cause a huge loss of information, but the loss may decrease when k increases. Fast DMM models do dimensionality reduction and classification in one step via a combination of Fast MM and logistic regression. In principle, we may also use these two algorithms sequentially in two steps, i.e., first using Fast MM models to get a lowdimensional representation, and then applying logistic regression on the low-dimensional representation for classification. The results with different choices of k are presented in Table 3.8 and 3.9 for text and UCI data respectively. It is clear that Fast DMM models outperform the Fast MM&LR strategy. In addition, we have used the labels to initiate Fast MM. Without the labels for initialization, the performance for Fast MM&LR will be even worse. In conclusion, by combining Fast MM and logistic regression together, Fast DMM achieves supervised dimensionality reduction to obtain a better low-dimensional representation than Fast MM, which helps classification.

71 3.4.3 Topics from Fast DLDA 55 As we have mentioned, DMM models generate interpretable results. We give an example of several topic word lists on Nasa generated by Fast DLDA (k = t + 30) in Table It is also an interesting result demonstrating the effect of allowing a larger number of components than the number of classes (k > t), that is, Fast DLDA may discover topics which are not explicitly specified in class labels, while maintaining the predefined number of classes. The first three topics in Table 3.10 correspond to three classes in Nasa respectively, but topic 4, which we call passenger medical emergency, could be considered as a subcategory of the passenger class, and it is not specified in the labels. Neither NBC nor SVM is able to generate this type of results. 3.5 Conclusion In this chapter, we have discussed discriminative mixed-membership models, as a combination of unsupervised mixed-membership models and multi-label logistic regression. In particular, we have DLDA for text data and DMMNB for non-text data. An important property of DMM models is that they allow the number of components k to be different from the number of classes c. Interestingly, a larger k helps to discover the components not specified in labels and increase classification accuracy. Applying fast variational inference algorithm to DMM yields Fast DMM models, which are significantly faster and also mostly more accurate than DMM. In addition, Fast DLDA and Fast DMMNB are competitive with the state-of-the-art classification algorithms in terms of the accuracy, especially on text data, and are able to generate interpretable results.

72 Chapter 4 Bayesian Cluster Ensembles For clustering on one way data, cluster ensembles provide a framework for combining multiple base clusterings into a single consolidated clustering without accessing the features of the data or base clustering algorithms. Compared to individual clustering algorithms, cluster ensembles generate more robust and stable clustering results [125]. In principle, cluster ensembles can leverage distributed computing by calculating the base clusterings in an entirely distributed manner [121]. In addition, since cluster ensembles only need access to the base clustering results instead of the data itself, they provide a convenient approach to privacy preservation and knowledge reuse [121]. Such desirable aspects have made the study of cluster ensembles increasingly important in the context of data mining. In addition to generating a consensus clustering from a complete set of base clusterings, it is highly desirable for cluster ensemble algorithms to have several additional properties suitable for real life applications. First, there may be missing values in the base clusterings. For example, in a customer segmentation application, while there will be legacy clusterings on old customers, there will be no result on the new customers. Cluster ensemble algorithms should be able to build consensus clusters with such missing information on base clusterings. Second, there may be restrictions on bringing all the base clusterings to one place to run the cluster ensemble algorithm. Such restrictions may be due to the fact that the base clusterings are with different organizations and considered as private information so that they cannot be shared. Cluster ensemble algorithms should be able to work with such column-distributed base clusterings. 56

73 57 Third, the data objects themselves may be distributed over multiple locations; while it is possible to get a base clustering across the entire dataset by message passing, base clusterings for different parts of data will be in different locations, and there may be restrictions on bringing them together at one place. For example, for a customer segmentation application, different vendors may have different subsets of customers, and a base clustering on all the customers can be performed using privacy preserving clustering algorithms; however, the cluster assignments of the customer subsets for each vendor is private information which they will be unwilling to share directly for the purposes of forming a consensus clustering. Again, it will be desirable to have cluster ensemble algorithms handle such row-distributed base clusterings. Current cluster ensemble algorithms, such as the cluster-based similarity partitioning algorithm (CSPA) [121], hypergraph partitioning algorithm (HGPA) [121], or k-means based algorithms [73] are able to accomplish one or two of the above variants of the problem. However, none of them was designed to address all of the variants. In principle, the recently proposed mixture modeling approach to learning cluster ensembles [125] is applicable to the variants, but the details have not been reported in the literature. In this paper, we propose Bayesian cluster ensembles (BCE), which can solve the basic cluster ensemble approach using a Bayesian approach, i.e., by effectively maintaining a distribution over all possible consensus clusterings. It also seamlessly generalizes to all the important variants discussed above. Similar to the mixture modeling approach, BCE treats all base clustering results for each object as a feature vector with discrete feature values, and learns a mixed-membership model from such a feature representation. Extensive empirical evaluation demonstrates that BCE is not only versatile in terms of its applicability, it mostly outperforms the other cluster ensemble algorithms in terms of stability and accuracy. There have been three main classes of cluster ensemble algorithms. The most popular algorithms for cluster ensemble are graph-based models [121, 47, 6]. The main idea of this class of algorithms is to convert the results of base clusterings to a hypergraph or a graph and then use graph partitioning algorithms to obtain ensemble clusters. The second class of algorithms [50, 46, 78] first convert the base clustering results into a coassociation or similarity matrix between the data points, and then perform clusterings based on the matrix. The third class of cluster ensemble algorithms [125] treat the base

74 58 clustering results as a new feature vector for the data points and perform clustering using the new feature vectors. Bayesian cluster ensemble proposed in this chapter falls into the third category, but instead of using a mixture model as in [125], we use mixed-membership naive Bayes models (MMNB) as in Chapter 2. Therefore, BCE could also be considered as an application of MMNB. 4.1 Problem Definition Given N data points O = {o i, [i] N 1 } ([i]n 1 i = 1 N) and M base clustering algorithms C = {c j, [j] M 1 }, we get M base clusterings of the data points, one from each algorithm. The only requirement from a base clustering algorithm is that it generates a cluster assignment or id for each of the N data points {o i, [i] N 1 }. The number of clusters generated by different base clustering algorithms may be different. We denote the number of clusters generated from c j by k j, so that the cluster ids assigned by c j range from 1 to k j. If λ ij {1,..., k j } denotes the cluster id assigned to o i by c j, the base clustering algorithm c j gives a clustering of the entire dataset, given by λ j = {λ ij, [i] N 1 } = {c j (o i ), [i] N 1 }. The results from M base clustering algorithms can be stacked together to form an (N M) matrix B, whose j th column is λ j, as shown in Figure 4.1(a). The matrix can be viewed from another perspective: Each row x i of the matrix, i.e., all base clustering results for o i, gives a new vector representation for the data point o i (4.1(b)). In particular, x i = {x ij, [j] M 1 } = {c j (o i ), [j] M 1 }. Given the base clustering matrix B, the cluster ensemble problem is to combine the M base clustering results for N data points to generate a consensus clustering, which should be more accurate, robust, and stable than the individual base clusterings. The traditional approach to process the base clustering results is column-wise (Figure 4.1(a)), i.e., we consider B as a set of M columns of base clustering results {λ j, [j] M 1 }, and we try to find out the consensus clustering λ. The disadvantage of the column-wise perspective is that it either needs to find out the correspondence

75 59 Figure 4.1: Two ways of processing base clustering results for cluster ensemble. between different base clusters generated by different algorithms or construct a graph before proceeding [121]. Cluster correspondence problems are difficult to solve, and constructing graphs increases the complexity. A simpler approach to cluster ensemble problem, which is what we use in this paper, is to read the matrix B in a row-wise (Figure 4.1(b)) way. All base clustering results for a data point o i can be considered as a vector x i with discrete values on each dimension [125], and we consider base clustering matrix B as a set of N rows of M-dimensional vectors {x i, [i] N 1 }. From this perspective, the cluster ensemble problem becomes finding a clustering λ for {x i, [i] N 1 }, where λ is a consensus clustering over all base clusterings. Further, by considering the cluster ensemble problem from this perspective, we naturally avoid cluster correspondence problem, because for each x i, λ 1 and λ 2 are just two features, they are conditionally independent in the naive Bayes setting for clustering. While the basic cluster ensemble framework assumes all base clustering results for all data points are available in one place to perform the analysis, real life applications often need variants of the basic setting. In this paper, we discuss three important variants: missing value cluster ensembles, row-distributed and column-distributed cluster ensembles Missing Value Cluster Ensembles. When several base clustering results are missing for several data points, we have a missing value cluster ensemble problem. Such a problem appears due to various reasons.

76 60 For example, if there are new data points added to the dataset after running clustering algorithm c j, these new data points will not have base clustering results corresponding to c j. In missing value cluster ensemble, instead of dealing with a full base clustering matrix B, we are dealing with a matrix with missing entries Row-distributed Cluster Ensembles. For row-distributed cluster ensembles, base clustering results of different data points (rows) are at different locations. The corresponding real life scenario is that different subsets of the original dataset are owned by different organizations, or cannot be put together in one place due to size, communication, or privacy constraints. While distributed base clustering algorithms, such as distributed privacy preserving k-means [63], can be run on the subsets to generate base clustering results, due to the restrictions on sharing, the results on different subsets cannot be transmitted to a central location for analysis. Meanwhile, combining the results on different subsets helps to generate a more reasonable ensemble clustering. Therefore, it is desirable to learn a consensus clustering in a row-distributed manner Column-distributed Cluster Ensembles. For column-distributed cluster ensemble, different base clustering results of all data points are at different locations. The corresponding real life scenario is that separate organizations have different base clusterings on the same set of data points, e.g., different e-commerce vendors having customer segmentations on the same customer base. The base clusterings cannot be shared with others due to privacy concerns, but each organization has an incentive to get a more robust consensus clustering. In such a case, the cluster ensemble problem have to be solved in a column-distributed way. 4.2 Bayesian Cluster Ensembles In this section, we propose Bayesian cluster ensemble model. The main idea of BCE is to use base clustering results B as the input to mixed-membership naive Bayes in Chapter 2, i.e., each x i is a data point, and the result of MMNB gives the consensus clustering. In particular, given a base clustering matrix B = {x i, [i] N 1 } for N data

77 points, we assume there exists a Bayesian graphical model generating B. In particular, we assume that each object x i has an underlying mixed-membership to different consensus clusters. Let π i denote the latent mixed-membership vector for x i ; if there are k consensus clusters, π i is a discrete distribution over the k clusters. From the generative model perspective, we assume that π i is sampled from a Dirichlet distribution, with parameter α. Further, each latent consensus cluster c, [c] k 1, has a discrete distribution θ cj over the cluster ids {1,..., k j } for base clustering result λ j. The full generative process for each x i is assumed to be as follows: 1. Choose π i Dirichlet(α). 2. For the j th base clustering of x i, if it is non-missing: (a) Choose a component z ij = c discrete(π i ); (b) Choose the base clustering result x ij discrete(θ cj ). Thus, the model contains the model parameters (α, Θ), where Θ = {θ cj, [c] k 1, [j]m 1 }, the latent variables {π i, z ij, [i] N 1, [j]m 1 } and the actual observations B = {x ij, [i] N 1, [j]m 1 }. BCE can be viewed as a special case of mixed-membership naive Bayes models [114] by choosing a discrete distribution as the generative model. Further, BCE is closely related to LDA [25], although the models are applicable to different types of data. Given the model parameters α and Θ, the joint distribution of latent and observed variables {x i, z i, π i } is given by: p(x i, π i, z i α, Θ) = p(π i α) M j=1, x ij p(z ij = c π i )p(x ij θ cj ), where x ij denotes that there exists a j th base clustering result for x i, so the product is only over the existing base clustering results. By integrating over the latent variables {z i, π i }, the marginal probability for each x i is given by: p(x i α, Θ) = p(π i α) π i M j=1, x ij 61 p(z ij = c π i )p(x ij θ cj )dπ i. (4.1) c

78 4.3 Variational Inference for BCE 62 We have assumed a generative process for the base clustering matrix B = {x i, [i] N 1 } in Chapter 4.2. Given the observable matrix B, our final goal is to infer the mixedmembership {π i, [i] N 1 } of each object to the consensus clusters. However, from MMNB in Chapter 2.3, we know that it is intractable to do parameter estimation and inference directly for BCE. Therefore, we use variational inference as in Chapter 2.3. In particular, we introduce a family of variational distributions as q(π i, z i γ i, ϕ i ) = q(π i γ i ) as an approximation of the true posterior M q(z ij ϕ ij) (4.2) j=1 x ij p(π i, z i x i, α, Θ) = p(π i, z i, x i α, Θ) p(x i α, Θ), (4.3) where γ i is a Dirichlet distribution parameter, and ϕ i = {ϕ ij, [j] M 1 } are discrete distribution parameters. Following Chapter 2.3, the update equation for variational parameter ϕ and γ are given by ( ϕ ijc exp Ψ(γ ic ) Ψ( γ ic = α c + M ϕ ijc, j=1 x ij k k j γ ic ) + c =1 r=1 ) 1(x ij = r) log θ cj (r) (4.4) N [i] 1, [j] M 1, [c] k 1, (4.5) where ϕ ijc is the c th component of the variational discrete distribution ϕ ij for z ij, γ ic is the c th component of the variational Dirichlet distribution γ i for π i, and 1(x ij = r) is an indicator showing if x ij = r. The update for the model parameters α and Θ are the same as in Chapter In particular, Θ is updated following θ cj (r) N i=1 ϕ ijc 1(x ij = r), [c] k 1, [j] M 1, [r] k j 1. (4.6)

79 The Dirichlet parameter α can be estimated via Newton-Raphson updates as in (2.21): 63 α c = α c η g c u h c, (4.7) with ( ( k g c = N Ψ ) ) ( ( N k Ψ(α c ) + Ψ(γ ic ) Ψ α c c =1 i=1 c =1 γ ic )) h c = NΨ (α c ) k c=1 u = g c/h c w 1 + k c=1 h 1 c k w = NΨ ( α c ), where Ψ is the digamma function, i.e., the first derivative of the log Gamma function. c=1 To run variational EM algorithm, we update ϕ and γ following (4.4) and (4.5) in the E-step and update Θ and α in the M-step. We run E-step and M-step iteratively until convergence. For computational complexity, since we only need to calculate Ψ(γ ic ) Ψ( k c =1 γ ic ) once for all ϕ ijc, [j] M 1, the complexity for updating ϕ in each E-step is then O((Nk2 + NMk k)t E ), where k = max{k j, [j] M 1 } and t E is the number of iterations inside each E-step. Also, the time for updating γ is O(NMkt E ). In the M-step, the complexity for updating Θ is O(NMk k). α is updated using Newton update and the time needed is O(kNt α ), where t α is the number of iterations in Newton updates. Compared to mixture model based cluster ensemble algorithm [125], BCE is computationally more expensive, since it has iterations over (4.4) and (4.5) inside the E-step, while [125] uses a direct EM algorithm which has the E-step in a closed form. However, as we show in the experiments, BCE achieves significantly better performance than mixture models Row-distributed EM Algorithm. In row-distributed EM for cluster ensemble, the object set O is partitioned into P parts {O (1), O (2),..., O (P ) } and different parts are assumed to be at different locations. We further assume that a set of distributed base clustering algorithms have been used to

80 obtain the base clustering results {B (1), B (2),...B (P ) }. Now, we outline a row-distributed variant of the variational inference algorithm. At each iteration t, given the initialization of model parameters (α (t 1), Θ (t 1) ), row-distributed variational EM for BCE proceeds as follows: 1. For each partition {B (p), [p] P 1 }, we obtain variational parameters (ϕ (p), γ (p) ) following (4.4) and (4.5), where ϕ (p) = {ϕ i x i B (p) } and γ (p) = {γ i x i B (p) }. 2. To update Θ following (4.6), we can write the right term of (4.6) as x i B (1) ϕ ijc 1(x ij = r) x i B (P ) ϕ ijc 1(x ij = r). Each part in the summation corresponds to one partition of B. To update θ cj (r), first, (p) = x i B (p) ϕ ijc 1(x ij = r) is calculated for each B p. Second, for each B (p) (p [2, P ]), we take p 1 q=1 (q) from B (p 1), generate p q=1 (q) by adding (p) to the summation, and pass it to B (p+1). Finally, after passing through all partitions, we have the summation as the right term of (4.6) to update θ cj (r) after normalization. 3. Updating α is a little tricky since it does not have a closed form solution. However, we notice that the update equation (4.7) for α c only depends on two variables: α c and {γ i, [i] N 1 }. α c can be obtain from the last iteration of Newton-Raphson algorithm. Regarding γ, we only need to know N i=1 Ψ(γ ic) and N i=1 Ψ( c γ ic) for calculating g in (4.7). We use a same strategy as for updating Θ: First we calculate Λ p = x i B (p) Ψ(γ ic ) and Ω p = x i B (p) Ψ( c γ ic) on each partition. Second, for each B (p) (p [2, P ]), we take p 1 q=1 Λ q and p 1 q=1 Ω q from B (p 1), generate p q=1 Λ q and p q=1 Ω q by adding Λ p and Ω p to the summations respectively, and pass them to B (p+1). Finally, after going through all partitions, we have the result for N i=1 (Ψ(γ ic) Ψ( c γ ic)), so we can update α c following (4.7). For each iteration of Newton-Raphson algorithm, we need to pass the summations through all partitions once. By the end of the t th iteration, we have the updated model parameters (α (t), Θ (t) ), which are used as the initialization for the (t + 1) th iteration. The algorithm is guaranteed to converge since it is essentially the same with the EM for the general case, except that 64

81 it works in a row-distributed way. By running EM distributedly, neither {O (p), [p] P 1 } nor {B (p), [p] P 1 } is passed around different individuals, but only the intermediate summations; in this sense, we achieve privacy preservation. As we have noticed, updating α is very expensive because it needs to pass the summations over all partitions for each Newton-Raphson iteration, which is practically infeasible for a dataset with a large number of partitions. Therefore, we next give a heuristic row-distributed EM, which does not have a theoretical guarantee for convergence, but worked well in practice in our experiments. At each iteration t, given the initialization of model parameters (α (t 1) (1), Θ (t 1) (1) ), heuristic row-distributed variational EM for BCE proceeds as follows: 1. For the first partition B (1), given (α (t 1) (1), Θ (t 1) (1) ), we obtain variational parameters (ϕ (1), γ (1) ) following (4.4) and (4.5). Also, we update (α (1), Θ (1) ) to get (α (t) (1), Θ(t) (1) ) following (4.7) and (4.6) respectively. 2. For the p th partition B (p), we initialize (α (p), Θ (p) ) with (α (t) (p 1), Θ(t) (p 1)) and obtain (ϕ (p), γ (p) ) following (4.4) and (4.5). We update (α (t) ) and pass them to the (p + 1) th partition. (p), Θ(t) (p) After going over all partitions, we are done with the t th iteration; the iterations are repeated until convergence. The initialization for (α (1) ) in the first iteration could (1), Θ(1) (1) be picked by random or by using some heuristics, and the initializations for (α (1), Θ (1) ) in the t th iteration are from (α (t 1) (P ), Θ (t 1) (P ) ). The iterations are run till the net change in the lower bound value is below a threshold, or when a pre-fixed number of iterations reached Column-distributed EM Algorithm. For column-distributed cluster ensemble, we design a client-server style algorithm, where each client maintains one base clustering, and the server gathers partial results from the clients and performs further processing. While we assume that there are M different clients, one can always work with a smaller number of clients by splitting the columns among the available clients. Given the initialization for model parameters (α (t), Θ (t) ), where (α (t) j, Θ(t) j ) is made available to the jth client, the column-distributed cluster ensemble at iteration t proceeds as follows:

82 1. E-step j th client: Given x ij and θ (t) cj for [i]n 1, [c]k 1, the jth client calculates k j r=1 1(x ij = r) log θ (t) cj (r) for [i]n 1, [c]k 1 and passes the results to the E-step server. 2. E-step server: Given k j r=1 1(x ij = r) log θ (t) cj (r) from the clients, for [i]n 1, [j]m 1, [c] k 1, the server calculates variational parameters {ϕ ijc, [i] N 1, [j]m 1, [c]k 1 } following (4.4). Given α (t) and {ϕ ijc, [i] N 1, [j]m 1, [c]k 1 }, the server updates {γ ic, [i] N 1, [c]k 1 } following (4.5). The parameters {ϕ ijc, [i] N 1, [c]k 1 } are passed to the M-step jth client and {γ ic, [i] N 1, [c]k 1 } are passed to the M-step server. 3. M-step j th client: Given x ij and ϕ ijc for [i] N 1, [c]k 1, θ(t+1) cj (r) is updated following (4.6) for [c] k 1 and [r]k j 1, and passed to E-step server for the (t + 1)th iteration. 4. M-step server: Given α (t) and γ ic for [i] N 1, [c]k 1, α(t+1) is updated following (4.7) and passed to E-step server for the next step. The initialization (α (0), Θ (0) ) is chosen at the beginning of the first iteration. In iteration t, (α (t), Θ (t) ) are initialized by (α (t 1), Θ (t 1) ), i.e., the results of the (t 1) th iteration. The algorithm is guaranteed to converge because it is essentially the same as the EM algorithm for general cluster ensembles except that it is running in a column-distributed way. The algorithm is expected to be more efficient than the general cluster ensemble if we ignore the communication overhead. In addition, j th client/server only has access to the j th base clustering results. The communication is only for the parameters and intermediate results, instead of base clusterings. Therefore, privacy preservation is also achieved. In BCE, the most computationally expensive part of the E-step is the update for ϕ. By running column-distributed EM, we are parallelizing most computation in updating ϕ, the time complexity of updating ϕ in each E-step hence decreases from O((Nk 2 + NMk k)t E ) to O((Nk 2 +Nk k)t E ), where k = max{k j, [j] M 1 }. In the M-step, the update of Θ decreases from O(NMk k) to O(Nk k) through parallelization Experimental Results In this section, we present experimental results for Bayesian cluster ensemble. We compare it to several state-of-the-art cluster ensemble algorithms, and show its versatility

83 67 Table 4.1: The number of data points, features, and classes in each data set. Dataset Data points Features Classes pima iris wdbc balance glass bupa wine magic ionosphere segmentation kdd chess wine quality by presenting experimental results on different experimental settings, such as missing value, row-distributed, and column-distributed cases. We use data sets from UCI machine learning repository and KDD Cup In particular, for UCI data, we pick 12 datasets (For wine quality we only keep the data points in 3 main classes, so the classes with very few number of data points are removed.). For KDD Cup data, there are four main classes among 37 classes in total. We randomly pick 1,000,000 data points from these four main classes. The number of data points, features and classes in each data set are listed in Table 4.1, where kdd99 is from KDD Cup 1999 and the rest are from UCI machine learning repository. In our experiments, there are two steps leading to the final consensus clustering. First, we run base clustering algorithms to get a set of base clustering results. Second, various cluster ensemble algorithms, including mixture model (MM) [125], CSPA, HGPA, MCLA [121] and k-means, are applied to the base clustering results to generate a consensus clustering. We compare their results with BCE. The comparison between BCE and other cluster ensemble algorithms are divided into five categories as follows: 1. General cluster ensemble (general). 2. Cluster ensemble with missing values (miss-v). 3. Cluster ensemble with increasing number of columns (increase-c), i.e., additional

84 base clusterings Column-distributed cluster ensemble (column-d). 5. Row-distributed cluster ensemble (row-d). Table 4.2 shows the five categories of experiments and the six algorithms we use. We can see that most of the algorithms can only accomplish a few tasks among the five. In principle, MM can be generalized to deal with all five scenarios; however, the literature does not have an explicit algorithm for column-distributed or row-distributed cluster ensembles using MM. As we can see from Table 4.2, BCE is the most flexible and versatile among the six algorithms. For evaluation, we use micro-precision [136] to measure accuracy of the consensus cluster with respect to the true labels: the micro-precision is defined as k MP = a c /n, (4.8) c=1 where k is the number of clusters and n is the number of objects, a c denotes the number of objects in consensus cluster c that are correctly assigned to the corresponding class. We identify the corresponding class for consensus cluster c as the true class with the largest overlap with the cluster, and assign all objects in cluster c to that class. Note that 0 MP 1 with 1 indicating the best possible consensus clustering, which has to be in full agreement with the class labels. In the following subsections, we will present the experimental results for five categories of experiments as in Table 4.2, starting from general cluster ensembles. Table 4.2: The applicability of algorithms to different experimental settings: indicates that the algorithm is applicable, and indicates otherwise. Algorithm general miss-v increase-c column-d row-d k-means CSPA HGPA MCLA MM BCE

85 69 Table 4.3: Cluster ensemble results using k-means with different initializations as base clustering algorithms. (a) and (b) are maximum and average M P for different cluster ensemble algorithms (magic04 and kdd99 are too large so CSPA could not finish its run.). (c) is the paired t-test for MM and BCE, where Mean-D is the mean of MP differences obtained by (MM - BCE), and sd-mm (BCE) is standard deviation of the MP s from MM (BCE). (a) Maximum MP base-clusterings cluster ensembles k-means MCLA CSPA HGPA MM k-means BCE iris wdbc ionosphere glass bupa pima wine magic balance segmentation kdd (b) Average MP base-clusterings cluster ensembles k-means MCLA CSPA HGPA MM k-means BCE iris wdbc ionosphere glass bupa pima wine magic balance segmentation kdd (c) Paired t-test Dataset Mean-D sd-mm sd-bce p-value iris wdbc ionosphere glass bupa pima wine magic balance segmentation kdd

86 70 Table 4.4: Cluster ensemble results using k-means, FCM, AP and METIS as the base clustering algorithms. (a) and (b) are maximum and average MP for different cluster ensemble algorithms. (c) is the paired t-test for MM and BCE, where Mean-D is the mean of MP differences obtained by (MM - BCE), and sd-mm (BCE) is standard deviation of the MP s from MM (BCE). (a) Maximum MP base-clusterings cluster ensembles k-means FCM AP METIS MCLA CSPA HGPA MM BCE iris wdbc ionosphere glass bupa pima wine balance segmentation chess wine quality (b) Average MP base-clusterings cluster ensembles k-means FCM AP METIS MCLA CSPA HGPA MM BCE iris wdbc ionosphere glass bupa pima wine balance segmentation chess wine quality (c) Paired t-test Dataset Mean-D sd-mm sd-bce p-value iris wdbc ionosphere glass bupa pima wine balance segmentation chess wine quality

87 4.4.1 General Cluster Ensembles. 71 In this subsection, we run two types of experiments: one only uses k-means as the base clustering algorithms, and the other uses multiple algorithms as the base clustering algorithms. Given N objects, we first use k-means as the base clustering algorithm on 11 datasets. For 10 UCI datasets, we run k-means 2000 times with different initializations to obtain 2000 base clustering results, which are divided evenly into 100 sets, with 20 base clustering results in each of them. For kdd99, we run the experiments following a same strategy, but we keep 3 sets with 5 base clustering results in each of them. Cluster ensemble algorithms are then applied on each subset. The maximum and average MP s over all subsets are reported in Table 4.3(a) and 4.3(b). We also use k-means, fuzzy c-means (FCM) [19], METIS [65], and affinity propagation (AP) [51] as the base clustering algorithms on 11 datasets for cluster ensemble 1. By running k-means 500 times, FCM 800 times, METIS 200 times, and AP 500 times with different initializations, we also obtain 2000 base clustering results. Following the same strategy above to run cluster ensemble algorithms, we have the maximum and average MP s in Table 4.4(a) and 4.4(b). The key observations from Table 4.3 and 4.4 can be summarized as follows: (1) BCE almost always has a higher max and average M P than base clustering results, which means the consensus clustering from BCE is indeed better in quality than the original base clusterings. (2) BCE outperforms other cluster ensemble algorithms for most of the times in terms of both maximum and average MP, no matter which base clustering algorithms are used. Since the results of MM and BCE are rather close to each other, to make a careful comparison, we run a paired t-test under the hypothesis H 0 : MP (MM) = MP (BCE) H a : MP (MM) < MP (BCE). The test is designed to assess the strength of the evidence against H 0 and supporting 1 We run this set of experiments on relatively small datasets since METIS and AP cannot run on large data sets such as kdd99

88 Dataset:Iris 0.9 Dataset:Wdbc 0.72 Dataset:Ionosphere Accuracy HGPA CSPA MCLA MM BCE Accuracy HGPA CSPA MCLA MM BCE Accuracy HGPA CSPA MCLA MM BCE Percentage of missing entries Percentage of missing entries Percentage of missing entries Accuracy Dataset:Pima HGPA CSPA MCLA MM BNB Accuracy Dataset:Glass HGPA CSPA MCLA MM BCE Accuracy Dataset:Bupa HGPA CSPA MCLA MM BCE Percentage of missing entries Percentage of missing entries Percentage of missing entries 0.74 Dataset:Wine 0.56 Dataset:Balance 0.6 Dataset:Segmentation Accuracy HGPA CSPA MCLA MM BCE Accuracy HGPA CSPA MCLA MM BCE Accuracy HGPA CSPA MCLA MM BCE Percentage of missing entries Percentage of missing entries Percentage of missing entries Figure 4.2: Average M P with increasing percentage of missing values. H a. Such strength is measured by the p-value, a lower p-value indicates stronger evidence. In our case, a lower p-value indicates that the performance improvements of BCE over MM is statistically significant. Usually the p-value less than 0.05 is considered as strong evidence. The results are shown in Table 4.3(c) and 4.4(c) respectively. BCE outperforms MM with a low p-value (< 0.05) most of the times, indicating that MP (BCE) is significantly better than MP (MM) on these datasets. In addition, the smaller standard deviation of BCE shows that it is more stable than MM Cluster Ensembles with Missing Values. Given 20 base clustering results for N objects, we randomly hold out p percent of data as missing values, with p increasing from 0 to 90 in steps of 4.5. We compare the performance of different algorithms except k-means, because k-means cannot handle missing values. Each time we run the algorithms 10 times and report MP on 9 datasets in Figure 4.2. Surprisingly, before the missing value percentage reaches 70%, most algorithms have a stable M P with increasing number of missing entries, without a

89 73 1 Dataset:Iris 0.9 Dataset:Wdbc 0.75 Dataset:Ionosphere Accuracy HGPA CSPA MCLA MM BCE Accuracy HGPA CSPA MCLA MM BCE Accuracy HGPA CSPA MCLA MM BCE Number of available base clusterings 0.7 Dataset:Pima Number of available base clusterings 0.7 Dataset:Glass Number of available base clusterings 0.65 Dataset:Bupa Accuracy HGPA CSPA MCLA MM BCE Accuracy HGPA CSPA MCLA MM BCE Accuracy HGPA CSPA MCLA MM BCE Number of available base clusterings Dataset:Wine Number of available base clusterings Dataset:Balance Number of available base clusterings Dataset:Segmentation Accuracy HGPA CSPA MCLA MM BCE Accuracy HGPA CSPA MCLA MM BCE Accuracy HGPA CSPA MCLA MM BCE Number of available base clusterings Number of available base clusterings Number of available base clusterings Figure 4.3: Average M P comparison with increasing number of available base clusterings. distinct decrease in accuracy. BCE is always among the top one or two in terms of the accuracy across different percentage of missing values, indicating that BCE is one of the best algorithms to deal with missing value cluster ensemble. Comparatively, HGPA seems to have the worst performance in terms of both the accuracy and stability Cluster Ensembles with Increasing Columns. In order to find out how increasing the number of base clusterings affects the cluster ensemble accuracy, we perform experiments for cluster ensemble with columns (base clusterings) increasing from 1 to 20 in steps of 1. We first generate 20 base clusterings as a pool. At each step s, we randomly pick s base clusterings from the pool, which is repeated for 50 times to generate 50 (N s) base clustering matrices (Note there are repetitions among these 50 matrices). We then run cluster ensemble on each of them. The average of MP over 50 runs at each step is reported in Figure 4.3 for 9 datasets. First, we can see that BCE is again among the top one or two on all the data sets in our experiments. Second, M P s for most of the algorithms increase dramatically when

90 Dataset:Iris Dataset:Ionosphere BCE K means Dataset:Wdbc Accuracy BCE K means Accuracy Accuracy BCE K means Number of partitions Number of partitions Number of partitions 0.54 Dataset:Glass 0.57 Dataset:Bupa 0.67 Dataset:Pima BCE K means Accuracy 0.48 BCE K means Accuracy Accuracy BCE K means Number of partitions Number of partitions Number of partitions 0.6 Dataset:Balance 0.75 Dataset:Wine Dataset:Segmentation 0.58 BCE K means BCE K means BCE K means 0.58 Accuracy 0.54 Accuracy 0.6 Accuracy Number of partitions Number of partitions Number of partitions Figure 4.4: Average M P with increasing number of distributed partitions. the number of base clusterings increases from 1 to 5. After that, no distinct increase is observed. On Pima, the accuracy even decreases when the number of base clustering is larger than 10, which is possibly due to the poor performance of the base clusterings. The trends of the curves might be related to the diversity of the base clusterings. In our experiments, we only use k-means for all base clusterings, so the cluster information may become redundant after a certain number of base clusterings have been used, and the accuracy does not increase anymore. The accuracy may keep on increasing with more columns if the base clusterings are generated by different algorithms Row-distributed Cluster Ensembles. For experiments on row-distributed cluster ensembles, we divide our 20 base clustering results by rows (approximately) evenly into P partitions, with P increasing from 1 to 10 in steps of 1. We compare the performance of row-distributed BCE with distributed k-means [63]. Note that in our experiments, we use the heuristic row-distributed EM as in Chapter Although no theoretical guarantee for convergence is provided, in our

91 Dataset:Wine Quality 10 4 Colum distributed Grneral Seconds Number of base clusterings Figure 4.5: The comparison of running time between column-distributed and general cluster ensemble. observation, the algorithm stops when model parameters do not change anymore within 10 iterations. The comparative results on 9 datasets are presented in Figure 4.4. It is clear that row-distributed BCE always has a higher accuracy than distributed k-means except on Balance. For most datasets, the performance of row-distributed BCE is more stable across varying number of partitions, indicating its robustness Column-distributed Cluster Ensembles. We run experiments for column-distributed cluster ensembles with increasing number of base clusterings (20, 60, 120, 240, 480, 960, 1440, 1920), which are picked randomly from a pool of 3000 base clustering results. We run the client-server style algorithm as in Chapter with one client maintaining one base clustering, such that multiple clients could run in parallel. The accuracy in the column-distributed case would be the same as the general cluster ensemble using BCE since they are using exactly the same algorithm except the fact that the column-distributed variants run it in a distributed manner. If we ignore the communication overhead between the clients and server, the comparison of running time between the column-distributed and general cluster ensemble is presented in Figure 4.5. We can see that column-distributed cluster ensemble is much more efficient than the general case, especially when the number of base clusterings is large, the column-distributed variant is several orders of magnitudes faster. Therefore, the column-distributed BCE is readily applicable to the real life settings with large data sets.

92 4.5 Conclusion 76 In this chapter, we have discussed Bayesian cluster ensembles (BCE), a mixed-membership generative model for obtaining a consensus clustering by combining multiple base clustering results. BCE provides a Bayesian way to combine clusterings, and entirely avoids cluster label correspondence problems encountered in graph based approaches to the cluster ensemble problem. A variational approximation based algorithm is proposed for learning a Bayesian cluster ensemble. Compared with existing algorithms, BCE is the most versatile because of its applicability to several variants of the cluster ensemble problem, including missing value cluster ensembles, row-distributed and columndistributed cluster ensembles. In addition, extensive experimental results show that BCE outperforms other algorithms in terms of accuracy and stability, and it can be run in a distributed manner without exchanging base clustering results, thereby preserving privacy and/or substantial speed-ups.

93 Chapter 5 Bayesian Co-clustering In Chapters 2-4, we have introduced clustering algorithms on one-way data. While one-way data is arguably the most common data presentation, in reality, a considerable amount of data are considered as two-way, capturing the relation between two entities of interest. For example, users rate movies in recommendation systems, customers purchase products in market-basket analysis, genes have expressions under experiments in computational biology, etc. Such two-way data are represented as a data matrix with rows and columns representing one entity respectively. An important data mining task pertinent to two-way data is to get a clustering of each entity, e.g., movie and user groups in recommendation systems, product and customer groups in market-basket analysis, etc. Traditional clustering algorithms do not perform well on such problems because they consider each row/column of the matrix as a feature vector and they are unable to capture the similarity among the features. For instance, even if two users have similar preference on two movies of a same type, traditional clustering algorithms may still put the users in two different clusters since they do not have similar preference on the exact same movies. In comparison, co-clustering algorithms [59], i.e., simultaneous clustering of rows and columns of a data matrix, can achieve a much better performance in terms of discovering the structure of data [30] and predicting the missing values [4] by capturing the similarity among the rows and columns of the matrix. Co-clustering has recently received significant attention in algorithm development and applications. [38], [30], and [54] applied co-clustering to text mining, bioinformatics 77

94 78 and recommendation systems respectively. [10] proposed a generalized Bregman coclustering algorithm by considering co-clustering as a matrix approximation problem. While these techniques work reasonably on real data, one important restriction is that almost all of these algorithms are partitional [69], i.e., a row/column belongs to only one row/column cluster. Such an assumption is often restrictive since objects in real world data typically belong to multiple clusters possibly with varying degrees. For example, a user might be an action movie fan and also a cartoon movie fan. Similar situations arise in most other domains. Therefore, a mixed membership of rows and columns might be more appropriate, and at times essential for describing the structure of such data. It is also expected to substantially benefit the application of co-clustering in such domains. In this chapter, we introduce Bayesian co-clustering (BCC) by viewing co-clustering as a generative mixture modeling problem. We assume each row and column to have a mixed membership respectively, from which we generate row and column clusters. Each entry of the data matrix is then generated given that row-column cluster, i.e., the co-cluster. We introduce separate Dirichlet distributions as Bayesian priors over mixed memberships, effectively averaging the mixture model over all possible mixed memberships. Further, BCC can use any exponential family distribution [12] as the generative model for the co-clusters, which allows BCC to be applied to a wide variety of data types, such as real, binary, or discrete matrices. For inference and parameter estimation, we propose an efficient variational EM-style algorithm that preserves dependencies among entries in the same row/column. The model is designed to naturally handle sparse matrices as the inference is done only based on the non-missing entries. Moreover, as a useful by-product, the model accomplishes co-embedding, i.e., simultaneous dimensionality reduction of individual rows and columns of the matrix, leading to a simple way to visualize the row/column objects. The efficacy of BCC is demonstrated by the experiments on simulated and real data. The existing literature has a few examples of generative models for co-clustering. Nowicki et al. [98] proposed a stochastic blockstructures model that builds a mixture model for stochastic relationships among objects and identifies the latent cluster via posterior inference. Kemp et al. [67] proposed an infinite relational model that discovers stochastic structure in relational data in form of binary observations. Airoldi et al. [5] recently proposed a mixed membership stochastic blockmodel that relaxes the

95 79 single-latent-role restriction in stochastic blockstructures model. Such existing models have one or more of the following limitations: (1) The model only handles binary relationships; (2) The model deals with relation within one type of entity, such as a social network among people; (3) There is no computationally efficient algorithm to do inference, and one has to rely on stochastic approximation based on sampling. The proposed BCC model has none of these limitations, and actually goes much further by leveraging the good ideas in such models. 5.1 Bayesian Co-clustering Given an N M data matrix X, for the purpose of co-clustering, we assume there are k 1 row clusters {z 1 = g, [g] k 1 1 } ([g]k 1 1 g = 1... k 1 ) and k 2 column clusters {z 2 = h, [h] k 2 1 }. Bayesian co-clustering assumes two Dirichlet distributions Dirichlet(α 1) and Dirichlet(α 2 ) for rows and columns respectively, from which the mixing weights π 1i and π 2j for each row i and each column j are generated. Row clusters for entries in row i and column clusters for entries in column j are sampled from discrete distributions discrete(π 1i ) and discrete(π 2j ) respectively. A row cluster g and a column cluster h together decide a co-cluster (g, h), which has an exponential family distribution p ψ (x θ gh ), where θ gh is the parameter of the generative model for co-cluster (g, h). For simplicity, we drop ψ from p ψ (x θ gh ), and the generative process for the whole data matrix is as follows (Figure 5.1): 1. For each row i,[i] N 1, choose π 1i Dirichlet(α 1 ). 2. For each column j,[j] M 1, choose π 2j Dirichlet(α 2 ). 3. To generate a non-missing entry in row i and column j, (a) choose z 1 discrete(π 1i ), z 2 discrete(π 2j ), (b) choose x ij p(x θ z1 z 2 ). For this proposed model, the marginal probability of an entry x in the data matrix X is given by: p(x α 1, α 2, Θ) = π 1 π 2 p(π 1 α 1 )p(π 2 α 2 ) z 1 z 2 p(z 1 π 1 )p(z 2 π 2 )p(x θ z1 z 2 )dπ 1 dπ 2.

96 80 α 1 α 2 N π 1 π 2 M Z 1 Z 2 k 1 k 2 θ x ~ S Figure 5.1: Bayesian co-clustering model. The probability of the entire matrix is, however, not the product of all such marginal probabilities. That is because π 1 for any row and π 2 for any column are sampled only once for all entries in this row/column. Therefore, the model introduces a coupling between observations in the same row/column, so they are not statistically independent. Note that this is a crucial departure from most mixture models, which assume the joint probability of all data points to be simply a product of the marginal probabilities of each point. The overall joint distribution over all observable and latent variables is given by p(x, π 1i, π 2j, z 1ij, z 2ij, [i] N 1, [j] M 1 α 1, α 2, Θ) (5.1) ( ) = p(π 1i α 1 ) p(π 2j α 2 ) p(z 1ij π 1i )p(z 2ij π 2j )p(x ij θ z1ij,z 2ij ) δ ij, i j i,j where δ ij is an indicator function which takes value 0 when x ij is missing and 1 otherwise, so only the non-missing entries are considered, z 1ij {1,..., k 1 } is the latent row cluster and z 2ij {1,..., k 2 } is the latent column cluster for observation x ij. Since the observations are conditionally independent given {π 1i, [i] N 1 } for all rows and {π 2j, [j] M 1 }

97 81 γ φ 1 1 γ φ 2 2 N π 1 w z 1 1 M π 2 w 2 z 2 (a) row (b) column Figure 5.2: Variational distribution q. γ 1, γ 2 are Dirichlet parameters, ϕ 1, ϕ 2 are discrete parameters. for all columns, the joint distribution p(x,π 1i, π 2j, [i] N 1, [j] M 1 α 1, α 2, Θ) ( ) = p(π 1i α 1 ) p(π 2j α 2 ) i j i,j p(x ij π 1i, π 2j, Θ) δ ij, where the marginal probability p(x ij π 1i, π 2j, Θ) = p(z 1ij π 1i )p(z 2ij π 2j )p(x ij θ z1ij,z 2ij ). z 1ij z 2ij Marginalizing over {π 1i, [i] N 1,} and {π 2j, [j] M 1 matrix X is: p(x α 1, α 2, Θ) = z 1ij π 1i i=1,...,n π 2j j=1,...,m }, the probability of observing the entire ( ) p(π 1i α 1 ) p(π 2j α 2 ) (5.2) i p(z 1ij π 1i )p(z 2ij π 2j )p(x ij θ z1ij,z 2ij ) δ ij dπ 11 dπ 1N dπ 21 dπ 2M. i,j z 2ij It is easy to see (Figure 5.1) that one-way Bayesian clustering models such as MMNB and LDA are special cases of BCC. Further, BCC inherits all the advantages of MMNB and LDA ability to handle sparsity, applicability to diverse data types using any exponential family distribution, and flexible Bayesian priors using Dirichlet distributions. j

98 82 Table 5.1: Expression for terms in L(γ 1, γ 2, ϕ 1, ϕ 2 ; α 1, α 2, Θ). Term E q [log p(π 1 α 1 )] E q [log p(z 1 π 1 )] E q [log q(π 1 γ 1 )] E q [log q(z 1 ϕ 1 )] E q [log p(x z 1, z 2, Θ)] Expression N k1 i=1 g=1 (α 1g 1)(Ψ(γ 1ig ) Ψ( k 1 l=1 γ 1il)) + N log Γ( k 1 g=1 α 1g) N k 1 g=1 log Γ(α 1g) N k1 i=1 g=1 w 1iϕ 1ig (Ψ(γ 1ig ) Ψ( k 1 l=1 γ 1il)) N k1 i=1 g=1 (γ 1ig 1)(Ψ(γ 1ig ) Ψ( k 1 l=1 γ 1il)) + N i=1 log Γ( k 1 g=1 γ 1ig) N k1 i=1 g=1 log Γ(γ 1ig) M k1 j=1 N i=1 N M k1 i=1 j=1 g=1 k2 g=1 h=1 δ ijϕ 1ig ϕ 2jh log ϕ 1ig k2 h=1 δ ijϕ 1ig ϕ 2jh log p(x ij θ gh ) 5.2 Inference and Learning Given the data matrix X, the learning task for the BCC is to estimate the model parameters (α 1, α 2, Θ ) such that the likelihood of observing the matrix X is maximized. Since computation of log p(x α 1, α 2, Θ) is intractable, we use variational inference to get a tractable lower bound to log p(x α 1, α 2, Θ). In particular, we introduce a variational distribution q(z 1, z 2, π 1, π 2 γ 1, γ 2, ϕ 1, ϕ 2 ) (q for brevity) as an approximation of the latent variable distribution p(z 1, z 2, π 1, π 2 α 1, α 2, Θ): q(z 1,z 2, π 1, π 2 γ 1, γ 2, ϕ 1, ϕ 2 ) (5.3) ( N = q(π 1i γ 1i )) M N M q(π 2j γ 2j ) q(z 1ij ϕ 1i )q(z 2ij ϕ 2j ), i=1 j=1 i=1 j=1 where γ 1 = {γ 1i, [i] N 1 } and γ 2 = {γ 2j, [j] M 1 } are variational parameters for Dirichlet distributions with k 1 and k 2 dimensions respectively for rows and columns, and ϕ 1 = {ϕ 1i, [i] N 1 } and ϕ 2 = {ϕ 2j, [j] M 1 } are variational parameters for discrete distributions with k 1 and k 2 dimensions for rows and columns. Figure 5.2 shows the approximating distribution q as a graphical model, where w 1 and w 2 are the number of non-missing entries in row i and j. As compared to the variational approximation used in MMNB [114] and LDA [25], where the cluster assignment z for every single feature has a variational discrete distribution, in our approximation there is only one variational discrete distribution for an entire row/column, which is similar to the Fast variational inference in Chapter 2. Such a strategy helps maintain the dependencies among all the entries in a row or column, and the inference is fast due to the smaller number of variational parameters over which optimization needs to be done. By a direct application of Jensen s inequality [94], we obtain a lower bound for log p(x α 1, α 2, Θ):

99 83 log p(x α 1, α 2, Θ) (5.4) E q [log p(x, z 1, z 2, π 1, π 2 α 1, α 2, Θ)] E q [log q(z 1, z 2, π 1, π 2 γ 1, γ 2, ϕ 1, ϕ 2 )]. We use L(γ 1, γ 2, ϕ 1, ϕ 2 ; α 1, α 2, Θ), or L for brevity, to denote the lower bound. L could be expanded as: L(γ 1, γ 2, ϕ 1, ϕ 2 ; α 1, α 2, Θ) =E q [log p(π 1 α 1 )]+E q [log p(π 2 α 2 )]+E q [log p(z 1 π 1 )] +E q [log p(z 2 π 2 )] E q [log q(π 1 γ 1 )] E q [log q(π 2 γ 2 )] E q [log q(z 1 ϕ 1 )] E q [log q(z 2 ϕ 2 )] +E q [p(x z 1, z 2, Θ)] The expression for each type of term in L is listed in Table 5.1; the terms E q [log p(π 2 α 2 )], E q [log p(z 2 π 2 )], E q [log q(π 2 γ 2 )], and E q [log q(z 2 ϕ 2 )] have a similar form. Our algorithm maximizes the parameterized lower bounds with respect to the variational parameters (γ 1, γ 2, ϕ 1, ϕ 2 ) and the model parameters (α 1, α 2, Θ) alternatively Inference In the inference step, maximizing the lower bound L(γ 1, γ 2, ϕ 1, ϕ 2 ; α 1, α 2, Θ) w.r.t. the variational parameters (γ 1, γ 2, ϕ 1, ϕ 2 ) yields the following update equations for variational parameters: ( j,h ϕ 1ig exp Ψ(γ 1ig ) + δ ) ijϕ 2jh log p(x ij θ gh ) w 1i ( i,g ϕ 2jh exp Ψ(γ 2jh )+ δ ) ijϕ 1ig log p(x ij θ gh ) w 2j (5.5) (5.6) γ 1ig = α 1g + w 1i ϕ 1ig (5.7) γ 2jh = α 2h + w 2j ϕ 2jh, [g] k 1 1, [h]k 2 1, [i]n 1, [j] M 1 (5.8) where ϕ 1ig is the g th component of ϕ 1i, ϕ 2jh is the h th component of ϕ 2j, and similarly for γ 1ig and γ 2jh, and Ψ( ) is the digamma function. From a clustering perspective, ϕ 1ig denotes the degree of row i belonging to cluster g, for [i] N 1 and [g]k 1 1 ; and similarly for ϕ 2jh. We use simulated annealing [68] in the inference step to avoid bad local minima. In particular, instead of using (5.5) and (5.6) directly for updating ϕ 1ig and ϕ 2jh, we use ϕ (t) 1ig (ϕ 1ig) 1/t, ϕ (t) 2jh (ϕ 2jh) 1/t

100 at each temperature t. At the beginning, t =, so the probabilities of row i/column j belonging to all row/column clusters are almost equal. When t slowly decreases, the peak of ϕ (t) 1ig and ϕ(t) 2jh gradually show up until we reach t = 1, where ϕ(1) 1ig and ϕ(1) 2jh become ϕ 1ig and ϕ 2jh, as in (5.5) and (5.6). We then stop decreasing the temperature and keep on updating ϕ 1 and ϕ 2 until convergence. After that, we go on to update γ 1 and γ 2 following (5.7) and (5.8) Parameter Estimation We use L(γ 1, γ 2, ϕ 1, ϕ 2 ; α 1, α 2, Θ) as the surrogate objective function to be maximized for updating {α 1, α 2, Θ}, where (γ1, γ 2, ϕ 1, ϕ 2 ) are the optimum values obtained in the inference step. To estimate the Dirichlet parameters (α 1, α 2 ), one can use an efficient Newton update as in (2.21). In particular, the updating function for α 1 is: α 1 = α 1 + ηh(α 1 ) 1 g(α 1 ) (5.9) α 2 = α 2 + ηh(α 2 ) 1 g(α 2 ), (5.10) where H( ) and g( ) are the Hessian matrix and gradient for the lower bound function L at α 1 and α 2. η is a parameter for line search. For estimating Θ, in principle, a closed form solution is possible for all exponential family distributions [12, 15]. Similar as the parameter estimation in MMNB, following [12], the estimated mean {τ gh, [g] k 1 1, [h]k 2 1 } parameter of an exponential family is given by: 84 τ gh = N M i=1 j=1 δ ijx ij ϕ 1ig ϕ 2jh N M i=1 j=1 δ, (5.11) ijϕ 1ig ϕ 2jh and θ gh = ϕ(τ gh ) by conjugacy [12]. In particular, for a special case when the component distributions are univariate Gaussians. The update equations for {µ gh, σ 2 gh, [g]k 1 1, [h]k 2 1 } in Gaussian are given by: µ gh = N M i=1 j=1 δ ijx ij ϕ 1ig ϕ 2jh N M i=1 j=1 δ (5.12) ijϕ 1ig ϕ 2jh σ 2 gh = N i=1 M j=1 δ ij(x ij µ gh ) 2 ϕ 1ig ϕ 2jh N i=1 M j=1 δ ijϕ 1ig ϕ 2jh. (5.13)

101 85 Given the update equations, a variational EM algorithm updates variational parameters following (5.5)-(5.8) in the E-step and (5.9)-(5.11) in the M-step iteratively until convergence. 5.3 Experimental Results In this section, we present extensive experimental results for Bayesian co-clustering on simulated data sets and on real data sets Experiments on Simulated Data Three data matrices are generated with 4 row clusters and 5 column clusters, i.e., 20 co-clusters in total, such that each co-cluster generates a submatrix. We use Gaussian, Bernoulli, and Poisson as the generative model for each data matrix respectively and each submatrix is generated from the generative model with a predefined parameter, which is set to be different for different submatrices. After generating the data matrix, we randomly permute its rows and columns to yield the final dataset. For each data matrix, we do semi-supervised initialization by using 5% data in each co-cluster. The results include two parts: parameter estimation and cluster assignment. We compare the estimated parameters with the true model parameters used to generate the data matrix. Further, we evaluate the cluster assignment in terms of micro-precision as defined in (4.8). For each generative model, we run the algorithm three times and pick the estimated (a) True (b) Estimated Figure 5.3: Parameter estimation for Gaussian.

102 86 parameters with the highest log-likelihood. Log-likelihood measures the fit of the model to the data, so we are using the model that fits the data best among three runs. Note that no class label is used while choosing the model. The comparison of true and estimated parameters after alignment for Gaussian case is in Figure 5.3. The color of each sub-block represents the parameter value for that co-cluster (darker is higher). The cluster accuracy is shown in Table 5.2, which is the average over three runs. From these results, we observe two things: (1) Our algorithm is applicable to different data types by choosing an appropriate generative model; (2) We are able to get an accurate parameter estimation and a high cluster accuracy, with semi-supervised initialization by using only 5% of data Experiments on Real Data Three real datasets are used in our experiments Movielens, Foodmart, and Jester. The description of these data sets could be found in Chapter We also construct three corresponding binarized data sets. In particular, we binarize Movielens such that entries whose ratings are higher than 3 become 1 and others become 0, binarize Jester such that the non-negative entries become 1 and the negative entries become 0, and we binarize Foodmart such that entries whose number of products are below median are 0 and others are 1. We use both the original and binary datasets in our experiment. For binarized data, we use bernoulli distribution as the generative model. For original data, we use Discrete, Poisson, and Gaussian as generative models for Movielens, Foodmart and Jester respectively. For Foodmart data, there is one unit right shift of Poisson distribution since the value of non-missing entries starts from 1 instead of 0, so we substract 1 from all non-missing entries to shift it back. We compare BCC with MMNB and LDA. The comparison with MMNB is done on both binarized and original datasets. The comparison with LDA is done only on Gaussian Bernoulli Poisson Row 100% % 100% Column 100% % 100% Table 5.2: Micro-precision on simulated data.

103 87 binarized datasets since LDA is not designed to handle real values. To apply LDA, we consider the features with feature value 1 as the tokens appearing in each data point, like the words in a document. For simplicity, we use row cluster or cluster to refer to the user/customer clusters, and use column cluster to refer to the movie, product and joke clusters for BCC on Movielens, Foodmart and Jester respectively. To ensure a fair comparison, we do not use simulated annealing for BCC in these experiments because there is no simulated annealing in MMNB and LDA either. Perplexity Comparison Starting from a random initialization, we train the model to obtain model parameters (α 1, α 2, Θ ) that (locally) maximize the variational lower bound on the log-likelihood. We then use the model parameters to do inference, that is, inferring the mixed membership for rows/columns. In particular, there are two steps in our evaluation: (1) Combine training and test data together and do inference (E-step) to obtain variational parameters; (2) Use model parameters and variational parameters to obtain the perplexity on the test set. In addition, we also report the perplexity on the training set. Perplexity is defined in (2.33). Recall that a lower perplexity indicates a better modeling performance. For example, in Movielens, a low perplexity on the test set means that the model captures the preference pattern for users such that the model s predicted preferences on test movies for a user would be quite close to his actual preferences; on the contrary, a high perplexity indicates that the user s preference on test movies would be quite different from model s prediction. A similar argument works for Foodmart and Jester as well. We compare the perplexity among BCC, MMNB and LDA with varying number of row clusters from 5 to 25 in steps of 5, and a fixed number of column clusters for BCC to be 20, 10 and 5 for Movielens, Foodmart and Jester respectively. The results are reported as an average perplexity of 10-cross validation in Figures 5.4, 5.5 and Table 5.3. Figure 5.4 compares the perplexity of BCC, MMNB, and LDA on binarized Jester, and Figure 5.5 compares the perplexity of BCC and MMNB on original Movielens dataset, both with varying number of clusters. Note that due to the distinct differences of perplexity among three models, y-axes are not continuous and the unit scales are not all the same. Table 5.3 presents the perplexities on both binarized and original datasets

104 Perplexity 1.85 LDA BCC MMNB Perplexity 6 5 LDA BCC MMNB Number of Clusters Number of Clusters (a) Training Set (b) Test Set Figure 5.4: Perplexity comparison of BCC, MMNB and LDA with varying number of clusters on binarized Jester. Perplexity BCC MMNB Perplexity BCC MMNB Number of Clusters Number of Clusters (a) Training Set (b) Test Set Figure 5.5: Perplexity comparison of BCC and MMNB with varying number of clusters on original Movielens with fixed 10 row clusters. From these results, there are two observations: 1. For BCC and LDA, the perplexities of BCC on both training and test sets are 2-3 orders of magnitude lower than that of LDA, and the paired t-test shows that the distinction is statistically significant with an extremely small p-value. 2. For BCC and MMNB, althoigh MMNB sometimes has a lower perplexity than BCC on training sets, on test sets, the perplexities of BCC are lower than MMNB in all cases. Again, the difference is significant based on the paired t-test. MMNB s high perplexities on test sets indicate over-fitting, especially on the original Movielens data. In comparison, BCC behaves much better than MMNB on test sets, possibly because of two reasons: (1) BCC uses much less number of variational parameters than MMNB, so as to avoid overfitting; (2) BCC is able to capture the co-cluster structure which is missing in MMNB.

105 Table 5.3: Perplexity of BCC, MMNB, and LDA on binary and original datasets with 10 clusters. The p-value is obtained from a paired t-test on the differences of test set perplexities between BCC and LDA, as well as between BCC and MMNB. Train set Test set Test set perplexity perplexity p-value LDA MMNB BCC LDA MMNB BCC BCC BCC -LDA -MMNB Movielens <0.001 <0.001 Foodmart <0.001 <0.001 Jester <0.001 <0.001 (a) On binarized datasets 89 Train set Test set Test set perplexity perplexity p-value MMNB BCC MMNB BCC BCC -MMNB Movielens <0.001 Foodmart <0.001 Jester <0.001 (b) On original datasets Prediction Comparison Let X train and X test be the original training and test sets respectively. We evaluate the model s prediction performance as follows: We compute variational parameters (γ 1, γ 2, ϕ 1, ϕ 2 ) based on (X train, X test ), and use them to compute perplexity(x test ). We then repeat the process by modifying a certain percentage of the test set to create X test (noisy data), compute the variational parameters ( γ 1, γ 2, ϕ 1, ϕ 2 ) corresponding to (X train, X test ), and compute perplexity( X test ) using these variational parameters. If the model yields a lower perplexity on the true test set than on the modified one, i.e., perplexity(x test ) < perplexity( X test ), the model explains X test better than X test. If used for prediction based on log-likelihood, the model will accurately predict X test. For a good model, we would expect the perplexity to increase with increasing percentages of test data being modified. Ideally, such an increase will be monotonic, implying that the true test data X test is the most-likely according to the model and a higher perplexity could be used as a sign of more noisy data. In our experiments, since X train is fixed, instead of comparing perplexity(x test ) with perplexity( X test ) directly, we compare perplexity(x train, X test ) with perplexity(x train, X test ). We only compare the prediction on the binarized data, which is a reasonable simplification because in

106 90 Perplexity Jester Foodmart Movielens Percentage of Noise Figure 5.6: Perplexity curves for Movielens, Foodmart and Jester with increasing percentage of noise. real recommendation systems, we usually only need to know whether the user likes the movie/product/joke or not to decide whether we should recommend it. To add noise to binarized data, we flip the entries of 1 to 0 and 0 to 1. We record the perplexities with the percentage of noise p increasing from 1% to 10% in steps of 1% and report the average perplexity of 10 cross validation at each step. The perplexity curves are shown in Figure 5.6. At the starting point, with no noise, we have perplexity of data with the true test set X test. At the other extreme end, 10% of the entries in the test set have been modified. As shown in Figure 5.6, all three lines go up steadily with an increasing percentage of test data modified. This is a surprisingly good result, implying that our model is able to detect increasing noise and convey the message through increasing perplexities. The most accurate result, i.e., the one with the lowest perplexity, is exactly the true test set at the starting point. Therefore, BCC can be used to accurately predict missing values in a matrix. We add noise at a finer step of modifying 0.1% and 0.01% test data each time, and compare the prediction performance of BCC with LDA. The results on binarized Jester and Movielens are presented in Figure 5.7 and 5.8. In both figures, the first row is for adding noise at steps of 0.01% and the second row is for adding noise at steps of 0.1%. The trends of the perplexity curves show the prediction performance. On Jester, we can see that the perplexity curves for BCC in both Figure 5.7(a) and 5.7(c) go up steadily at almost all times. However, the perplexity curves for LDA go up and down from time to time, especially in Figure 5.7(b), which means that sometimes LDA fits the data

107 91 Perplexity Percentage of Noise x 10 3 (a) BCC Perplexity Percentage of Noise x 10 3 (b) LDA Perplexity Percentage of Noise (c) BCC Perplexity Percentage of Noise (d) LDA Figure 5.7: Perplexity curves of BCC and LDA with increasing percentage of noise on binarized Jester. with more noise better than that with less noise, indicating a lower prediction accuracy compared with BCC. The difference is even more distinct on Movielens. When adding noise at steps of 0.01%, there is no clear trend in perplexity curves in Figure 5.8(a) and 5.8(b), implying that neither BCC nor LDA is able to detect the noise at this resolution. However, when the step size increases to 0.1%, perplexity curve of BCC starts to go up as in Figure 5.8(c) but the perplexity curve of LDA goes down as in Figure 5.8(d). The decreasing perplexity with addition of noise indicates LDA does not have a good prediction performance on Movielens. While extensive results give supportive evidence to BCC s better performance, we should be cautious of the conclusion we draw from the direct perplexity comparison between BCC and LDA. Given a binary dataset, BCC works on all non-missing entries, but LDA only works on the entries with value 1. Therefore, BCC and LDA actually work on different data, and hence their perplexities cannot be compared directly. However, the comparison gives us a rough idea of these two algorithms behavior, such as the

108 Perplexity Perplexity Percentage of Noise x 10 3 (a) BCC Percentage of Noise x 10 3 (b) LDA Perplexity Percentage of Noise (c) BCC Perplexity Percentage of Noise (d) LDA Figure 5.8: Perplexity curves of BCC and LDA with increasing percentage of noise on binarized Movielens. distinct difference in perplexity ranges, similar perplexity trends with increasing number of clusters. Moreover, the result of prediction shows that BCC indeed does much better than LDA, no matter which part of dataset they are using. Visualization The co-clustering results give us a compressed representation of the original matrix. We can visualize it to study the relationship between row and column clusters. Figure 5.9 is an example of user-movie co-clusters on Movielens. There are sub-blocks, corresponding to 10 user clusters and 20 movie clusters. The shade of each sub-block is determined by the parameter value of the bernoulli distribution for each co-cluster. A darker sub-block indicates a larger parameter. Since the parameter of a bernoulli distribution implies the probability of generating an outcome 1 (rate 4 or 5), the darker the sub-block is, the more the corresponding movie cluster is preferred by the user cluster. Based on Figure 5.9, we can see that users in cluster 2 (U2) are a big fan of all

109 93 2 users movies Figure 5.9: Co-cluster parameters for Movielens. kinds of movies, and users in U5 seem uninterested in all movies except those in movie cluster 13 (M13). Moreover, movies in M18 are very popular and preferred by most of the users. In comparison, movies in M4 seem to be far from best sellers. We can also tell that users in U1 prefer M18 the best and M8 the worst. U2 and U6 share several common favorite types of movies. The variational parameters ϕ 1, with dimension k 1 for rows, and ϕ 2, with dimension k 2 for columns, give a low-dimensional representation for all the row and column objects. They can be considered as the result of a simultaneous dimensionality reduction over row and column feature vectors. We call the low-dimensional vectors ϕ 1 and ϕ 2 a co-embedding since they are two inter-dependent low-dimensional representations of the row and column objects derived from the original data matrix. Co-embedding is a unique and novel by-product of our algorithm, which accomplishes dimensionality reduction while preserving dependencies between rows and columns. None of partitional co-clustering algorithms is able to generate such an embedding, since they do not allow mixed membership to row and column clusters. To visualize the co-embedding, we apply ISOMAP [123] on ϕ 1 and ϕ 2 to further reduce the space to 2 dimensions. 1 The results of co-embedding for users and movies on binarized Movielens are shown in Figure 5.10(a) and 5.10(c). Each point in the figure denotes one user/movie. We mark three clusters with red, blue and green for users and movies respectively; other 1 An alternative approach would be to set k 1 and k 2 to 2, so that ϕ 1 and ϕ 2 are themselves 2 dimensional.

110 (a) User embedding movie cluster (b) User signatures (c) Movie embedding user cluster (d) Movie signatures Figure 5.10: Co-embedding and signatures for users (ϕ 1 ) and movies (ϕ 2 ) on Movielens dataset. points are colored pink. By visualization, we can see how the users/movies are scattered in the space, where the clusters are located, and how far one cluster is from another, etc. Such information goes far beyond clusters of objects only. In addition, we choose several points from the co-embedding to look at their properties. In Figure 5.10(a) and 5.10(c), we mark four users and four movies, and extract their signatures. In general, we can use a variety of methods to generate signature. In our experiment, we do the following: For each user, we get the number of movies she rates 1 in movie cluster 1-20 respectively. After normalization, this 20-dim unit vector is used as the signature for the user. Similarly, for each movie, we get the number of users giving it rate 1 in user cluster 1-10 respectively. The normalized 10-dim unit vector is used as the signature for the movie. The signatures are shown in Figure 5.10(b) and 5.10(d) respectively. The numbers on the right are user/movie IDs corresponding to those marked points in co-embedding plots, showing where they are located. We can see that each signature is quite different from others in terms of the value on each component.

111 5.4 Conclusion 95 In this chapter, we have discussed Bayesian co-clustering (BCC) which views co-clustering as a generative mixture modeling problem. BCC inherits the strengths and robustness of Bayesian modeling, is designed to work with sparse matrices, and can use any exponential family distribution as the generative model, thereby making it suitable for a wide range of matrices. Unlike existing partitional co-clustering algorithms, BCC generates mixed memberships for rows and columns, which seem more appropriate for a variety of applications. A key advantage of the proposed variational approximation approach for BCC is that it is expected to be significantly faster than a stochastic approximation based on sampling, making it suitable for large matrices in real life applications. Finally, the co-embedding obtained from BCC can be effectively used for visualization, subsequent predictive modeling, and decision making.

112 Chapter 6 Residual Bayesian Co-clustering Bayesian co-clustering (BCC) [112] achieves good performance for co-clustering on twoway data, but one of its limitation is that it cannot incorporate row and column biases into the model. In reality, the data matrix usually contains biases of rows and columns. For example, in a movie rating system, some users are generous in a sense that they tend to give high ratings, then 3 in 1-5 rating scheme indicates a poor movie. Meanwhile, for users who are usually critical, 3 might be a good rating score. In terms of movies, those with famous movie stars tend to get higher ratings while others do not. In such cases, the ratings are biased for each user and each movie. In this chapter, we propose residual Bayesian co-clustering (RBC). It is an extension of BCC and infers the coclustering from the residual matrix obtained by suitably subtracting row and column bias from the original matrix. That is why we refer to the model as residual Bayesian co-clustering. The efficacy of RBC is demonstrated by the experiments on real datasets. In particular, we show that RBC consistently outperforms several state-of-the-art coclustering algorithms on missing value prediction. 6.1 Residual Bayesian Co-clustering Given an N M data matrix X, for the purpose of co-clustering, we assume k 1 row clusters {z 1 = g, [g] k 1 1 } ([g]k 1 1 g = 1... k 1 ) and k 2 column clusters {z 2 = h, [h] k 2 1 }, and two Dirichlet distributions Dirichlet(α 1 ) and Dirichlet(α 2 ) from which the mixed membership {π 1i, [i] N 1 } and {π 2j, [j] M 1 } for rows and columns are generated respectively. 96

113 97 α 1 α 2 N π 1 π 2 M N b Z 1 Z m 2 1 k 1 k 2 M θ x ~ S m 2 Figure 6.1: The graphical model for RBC. Row clusters for entries in row i and column clusters for entries in column j are sampled from discrete distributions discrete(π 1i ) and discrete(π 2j ) respectively. A row cluster g and a column cluster h together decide a co-cluster (g, h), which has a Gaussian distribution N (µ gh, σgh 2 ), where µ gh and σgh 2 are the mean and variance for co-cluster (g, h). Note that in principle it is possible to generalize the model by using exponential family distributions as in BCC, but we only discuss the Gaussian case in this chapter. Assuming the means of each row i is m 1i and the means of each column j is m 2j, the generative model for RBC is given as follows: 1. For each row i,[i] N 1, choose π 1i Dirichlet(α 1 ). 2. For each column j,[j] M 1, choose π 2j Dirichlet(α 2 ). 3. To generate a non-missing entry in row i and column j, (a) choose z 1 =g discrete(π 1i ), z 2 =h discrete(π 2j ), (b) choose x ij N (x µ z1 z 2 + bm 1i + bm 2j, σz 2 1 z 2 ). The graphical model for RBC is shown in Figure 6.1, where S is the total number of the non-missing entries. The model only generates the non-missing entries, so it is able to handle matrices with missing values naturally. The generative process defines each entry x ij to be generated from a Gaussian distribution N (x µ z1 z 2 + bm 1i + bm 2j, σz 2 1 z 2 ), which means that the co-cluster mean µ z1 z 2 in RBC is actually defined on a matrix after subtracting the effects of row and column means, rather than on the original matrix.

114 b is a parameter to be estimated determining how strong the row and column effects are. In RBC, we assume equal row and column effects by using a same coefficient b for both m 1i and m 2j, but in principle, we can differentiate these two effects by introducing different coefficients. For each entry x ij in the data matrix X, given row cluster z 1ij column cluster z 2ij {1... k 2 }, the probability of x ij is defined as: p(x ij θ z1ij z 2ij,b)=p(x ij µ z1ij z 2ij +bm 1i +bm 2j,σ 2 z 1ij z 2ij ), 98 {1... k 1 } and where θ z1ij z 2ij = {µ z2ij z 2ij, σz 2 1ij z 2ij }. The generation of x ij not only depends on b and θ z1ij z 2ij, but also depends on m 1i and m 2j, so strictly, the probability should be denoted as p(x ij θ z1ij z 2ij, b, m 1i, m 2j ), but we omit m 1i and m 2j hereafter for brevity. Accordingly, the marginal probability of x ij is given by: p(x ij α 1, α 2, b, Θ) = p(π 1i α 1 )p(π 2j α 2 ) p(z 1ij π 1i )p(z 2ij π 2j )p(x ij θ z1ij z 2ij, b)dπ 1i dπ 2j, π 1i π 2j z 1ij z 2ij where Θ = {θ gh, [g] k 1 1, [h]k 2 1 } = {µ gh, σgh 2, [g]k 1 1, [h]k 2 1 }. The probability of the entire matrix X is not the product of all such marginal probabilities, because π 1 for each row and π 2 for each column are sampled only once for all entries in this row or column, so the entries in a same row or a same column are not statistically independent. The overall joint distribution over all observable and latent variables is given by p(x,π 1i, π 2j, z 1ij, z 2ij, [i] N 1, [j] M 1 α 1, α 2, b, Θ) = p(π 1i α 1 ) p(π 2j α 2 ) ( δij p(z 1ij π 1i )p(z 2ij π 2j )p(x ij θ z1ij z 2ij, b)), i j i,j where δ ij is an indicator function which takes value 0 when x ij is missing and 1 otherwise, so only the non-missing entries are considered. Marginalizing over {z 1ij, z 2ij, [i] N 1, [j]m 1 } and {π 1i, π 2j, [i] N 1, [j]m 1 observing the entire matrix X is: p(x α 1, α 2, b, Θ) = i,j z 1ij π 1i i=1,...,n π 2j j=1,...,m }, the probability of ( ) p(π 1i α 1 ) p(π 2j α 2 ) (6.1) ( ) δijdπ11 p(z 1ij π 1i )p(z 2ij π 2j )p(x ij θ z1ij,z 2ij, b) dπ 1N dπ 21 dπ 2M. z 2ij i j

115 6.2 Inference and Learning 99 Given the data matrix X, the learning task is to estimate model parameters (α 1, α 2, b, Θ ) such that the likelihood of observing the matrix X is maximized. Similar as in BCC, the computation of log p(x α 1, α 2, b, Θ) is intractable, so we follow the variational inference strategy as in Chapter Variational Inference To get a tractable lower bound for log p(x α 1, α 2, b, Θ), we introduce a same variational distribution q(z 1, z 2, π 1, π 2 γ 1, γ 2, ϕ 1, ϕ 2 ) as in (5.3), to serve as an approximation of the latent variable distribution p(z 1, z 2, π 1, π 2 α 1, α 2, b, Θ): q(z 1, z 2, π 1, π 2 γ 1, γ 2, ϕ 1, ϕ 2 ) ( N = q(π 1i γ 1i )) M N M q(π 2j γ 2j ) q(z 1ij ϕ 1i )q(z 2ij ϕ 2j ), i=1 j=1 i=1 j=1 where γ 1 = {γ 1i, [i] N 1 } and γ 2 = {γ 2j, [j] M 1 } are parameters of variational Dirichlet distributions for rows and columns, and ϕ 1 = {ϕ 1i, [i] N 1 } and ϕ 2 = {ϕ 2j, [j] M 1 } are parameters for variational discrete distributions for rows and columns. After introducing the variational distribution q, a direct application of Jensen s inequality [94] gives a lower bound to log p(x α 1, α 2, b, Θ): log p(x α 1, α 2, b, Θ) E q [log p(x, z 1, z 2, π 1, π 2 α 1, α 2, b, Θ)] E q [log q(z 1, z 2, π 1, π 2 γ 1, γ 2, ϕ 1, ϕ 2 )], =E q [log p(π 1 α 1 )]+E q [log p(π 2 α 2 )]+E q [log p(z 1 π 1 )] +E q [log p(z 2 π 2 )] E q [log q(π 1 γ 1 )] E q [log q(π 2 γ 2 )] E q [log q(z 1 ϕ 1 )] E q [log q(z 2 ϕ 2 )]. +E q [p(x z 1, z 2, b, Θ)]. (6.2) Denote the lower bound function as L(γ 1, γ 2, ϕ 1, ϕ 2 ; α 1, α 2, b, Θ). The first eight terms of L are the same as the expressions in Table 5.1, and the last term of (6.2) is given by E q [p(x z 1, z 2, b, Θ)] = N M k 1 k 2 i=1 j=1 g=1 h=1 δ ij ϕ 1ig ϕ 2jh ( (x ij µ gh bm 1i bm 2j ) 2 2σ 2 gh log 2πσgh 2 )

116 100 Given (α 1, α 2, b, Θ), maximize the lower bound function L(γ 1, γ 2, ϕ 1, ϕ 2 ; α 1, α 2, b, Θ) with respect to variational parameters (γ 1, γ 2, ϕ 1, ϕ 2 ) gives the update equations for variational parameters: ( j,h ϕ 1ig exp Ψ(γ 1ig ) δ ijϕ 2jh ((x ij µ gh + bm 1i bm 2j ) 2 /2σgh 2 log σ ) gh) ϕ 2jh exp ( Ψ(γ 2jh ) w 1i i,g δ ijϕ 1ig ((x ij µ gh + bm 1i bm 2j ) 2 /2σgh 2 log σ ) gh) w 2j (6.3) (6.4) γ 1ig = α 1g + w 1i ϕ 1ig (6.5) γ 2jh = α 2h + w 2j ϕ 2jh [g] k 1 1, [h]k 2 1, [i]n 1, [j] M 1 (6.6) The solution is not in a closed form. ϕ 1 and ϕ 2 depend on each other. γ 1 and γ 2 depend on ϕ 1 and ϕ 2 respectively. Maximizing the lower bound function L(γ 1, γ 2, ϕ 1, ϕ 2 ; α 1, α 2, b, Θ) with respect to model parameters (α 1, α 2, b, Θ) gives the update equations for model parameters. In particular, the update equations for α 1 and α 2 are the same as in (5.9) and (5.10), i.e., α 1 = α 1 + ηh(α 1 ) 1 g(α 1 ) (6.7) α 2 = α 2 + ηh(α 2 ) 1 g(α 2 ), (6.8) where H( ) and g( ) are the Hessian matrix and gradient for the lower bound function L at α 1 and α 2. η is a parameter for line search. The update equations for Θ = {µ gh, σgh 2, [g]k 1 1, [j]k 2 1 } and b are given by i,j µ gh = δ ijϕ 1ig ϕ 2jh (x ij bm 1i bm 2j ) i,j δ (6.9) ijϕ 1ig ϕ 2jh σ 2 gh = i,j δ ijϕ 1ig ϕ 2jh (x ij bm 1i bm 2j µ gh ) 2 i,j δ ijϕ 1ig ϕ 2jh (6.10) b = i,j,g,h δ ijϕ 1ig ϕ 2jh (x ij µ gh )(m 1i + m 2j ) i,j,g,h δ ijϕ 1ig ϕ 2jh (m 1i + m 2j ) 2, [g] k 1 1, [h]k 2 1 (6.11) A variational EM algorithm then updates variational parameters (γ 1, γ 2, ϕ 1, ϕ 2 ) in the E-step and model parameters (α 1, α 2, b, Θ) in the M-step iteratively until convergence.

117 6.2.2 Prediction 101 The prediction of the missing entries in existent rows and columns is straightforward. After running variational EM, we get not only the model parameters (α 1, α 2, b, Θ ), but also the variational parameters (ϕ 1, ϕ 2, γ 1, γ 2 ), where ϕ 1i = [ϕ 1i1, ϕ 1i2,..., ϕ 1ik1 ] T gives the mixed-membership of each row i belonging to each row cluster g, [g] k 1 1. Similarly, ϕ 2j gives the mixed-membership of each column j belonging to each column cluster h, [h] k 2 1. Therefore, ϕ 1igϕ 2jh gives the mixed membership of x ij to each co-cluster (g, h). The prediction of each entry r ij of the residual matrix R is given by ˆr ij = ϕ 1ig ϕ 2jh µ gh, which is the solution of [12] arg min ˆr ij E (g,h) ϕ1ig ϕ 2jh [ µ gh ˆr ij 2 ]. Therefore, after adjusting the effect of row and column means, the prediction of each entry ˆx ij is k 1 k 2 ˆx ij = ϕ 1ig ϕ 2jh µ gh + b m 1i + b m 2j, (6.12) g=1 h=1 The whole matrix could be approximated using ˆX = G T 1 Ξ G 2 + b m 1 e T 1 + b e 2 m T 2, (6.13) where G 1 R k 1 N and G 2 R k 2 M are matrices with ϕ 1i and ϕ 2j in columns respectively, Ξ R k 1 k 2 has µ gh on the (g, h)th entry, m 1 = [m 11,..., m 1N ] T, m 2 = [m 21,..., m 2M ] T, and e 1 R M and e 2 R M are the vectors with all ones. RBC is also able to handle the prediction of missing entries on new rows or columns which are not used in training process. When there are new users coming into the recommendation system and giving two or three ratings, RBC is able to predict the rest of the ratings for the new users without retraining the model. In particular, given an N M data matrix X, we first run RBC on it to get the model parameters (α 1, α 2, b, Θ ). For r new coming rows with a few non-missing entries, appending it to X yields an (N + r) M matrix Z. We run E-step on Z given (α 1, α 2, b, Θ ) to get (ϕ 1, ϕ 2, γ 1, γ 2 ) for Z. The prediction then can be performed following (6.12) for the missing entries in new rows. The algorithm is similarly applicable to new columns.

118 102 γ φ 1 1 γ φ 2 2 N π 1 z w1 1 M π 2 w 2 z 2 (a) row (b) column 6.3 Parallel RBC Figure 6.2: Variational distribution q. In very large matrices with millions of rows and columns, it is desirable to run missing value prediction in a distributed and parallel way for high efficiency. In this section, we propose parallel RBC which is applicable to large scale matrices Fully Factorized Variational Distribution The update expressions for ϕ 1 and ϕ 2 in (6.3) and (6.4) show that the update of ϕ 1i for each row i depends on ϕ 2j of all columns j, [j] M 1, and the update of ϕ 2j for each column j depends on ϕ 1i of all rows i, [j] N 1. Such dependency makes it difficult to parallelize RBC. Therefore, before we propose parallel RBC, we first propose a new variational inference algorithm using a fully factorized variational distribution. RBC with this new inference algorithm is referred to as RBC-FF. Given objective function in (6.1), to get a tractable lower bound, we have approximated the latent variable distribution p(z 1, z 2, π 1, π 2 α 1, α 2, b, Θ, X) using a variational distribution q(z 1, z 2, π 1, π 2 γ 1, γ 2, ϕ 1, ϕ 2 ), where all the entries in a same row i (column j) share a same discrete distribution with variational parameter ϕ 1i (ϕ 2j ) R k. In this section, we propose a new variational distribution q as a family of fully factorized distributions: q (z 1, z 2, π 1, π 2 γ 1, γ 2, ϕ 1, ϕ 2 ) ( N = q (π 1i γ 1i )) M N M q (π 2j γ 2j ) q (z 1ij ϕ 1ij )q (z 2ij ϕ 2ij ), i=1 j=1 i=1 j=1

119 103 Table 6.1: Expressions for terms in L(γ 1, γ 2, ϕ 1, ϕ 2 ; α 1, α 2, b, Θ) using q. Term E q [log p(π 1 α 1 )] E q [log p(z 1 π 1 )] E q [log q(π 1 γ 1 )] E q [log q(z 1 ϕ 1 )] E q [log p(x z 1, z 2, b, Θ)] Expression N k1 i=1 g=1 (α 1g 1)(Ψ(γ 1ig ) Ψ( k 1 l=1 γ 1il)) + N log Γ( k 1 g=1 α 1g) N k 1 g=1 log Γ(α 1g) N k1 i=1 g=1 w 1iϕ 1ig (Ψ(γ 1ig ) Ψ( k 1 l=1 γ 1il)) g=1 (γ 1ig 1)(Ψ(γ 1ig ) Ψ( k 1 l=1 γ 1il)) + N i=1 log Γ( k 1 g=1 γ 1ig) N k1 i=1 g=1 log Γ(γ 1ig) N i=1 k1 N M k1 k2 i=1 j=1 g=1 h=1 δ ijϕ 1ig ϕ 2jh log ϕ 1ig N M k1 k2 i=1 j=1 g=1 h=1 δ ijϕ 1ig ϕ 2jh ( (xij µgh bm1i bm2j)2 2σgh 2 log 2πσgh 2 ) where ϕ 1 = {ϕ 1ij, [i] N 1, [j]m 1 } and ϕ 2 = {ϕ 2ij, [i] N 1, [j]m 1 }. The graphical model of q is in Figure 6.2. Instead of sharing a discrete distribution with other entries in a same row (column), each entry x ij has its own discrete(ϕ 1ij ) and discrete(ϕ 2ij ) for z 1 and z 2 respectively. The disadvantage of using q instead of q is that the number of parameters for variational discrete distributions increases from (N + M) to 2 S, where S = NM in a full matrix. However, as we will show, using q reduces the dependency between rows and columns in the inference step, hence facilitates parallel RBC. Using q, each type of term in the lower bound function L(γ 1, γ 2, ϕ 1, ϕ 2 ; α 1, α 2, b, Θ) is given in Table 6.1. The summation of all terms in L has a coupling of the form ϕ 1ijg ϕ 2ijh. Letting Φ ij (g, h) = ϕ 1ijg ϕ 2ijh, Φ ij R k 1 k 2 denotes the probability of x ij belonging to each co-cluster (g, h), and Φ = {Φ ij, [i] N 1, [j]m 1 }. In the inference step, given (α 1, α 2, b, Θ), the update functions for variational parameters are given ( as follows: ) Φ ij (g, h) exp Ψ(γ 1ig ) + Ψ(γ 2jh ) (x ij µ gh bm 1i bm 2j ) 2 /2σgh 2 log σ gh γ 1ig = α 1g + γ 2jh = α 2h + (6.14) M k 2 δ ij Φ ij (g, h) (6.15) j=1 h=1 N k 1 i=1 g=1 δ ij Φ ij (g, h), [g] k 1 1, [h]k 2 1, [i]n 1, [j] M 1. (6.16) In parameter estimation, the updates for α 1 and α 2 are the same as in (6.7) and

120 (6.8), and the updates for Θ = {µ gh, σgh 2, [g]k 1 1, [h]k 2 1 } and b are given by i,j µ gh = δ ijφ ij (g, h)(x ij bm 1i bm 2j ) i,j δ (6.17) ijφ ij (g, h) σgh 2 = i,j δ ijφ ij (g, h)(x ij bm 1i bm 2j µ gh ) 2 i,j δ (6.18) ijφ ij (g, h) i,j,g,h b = δ ijφ ij (g, h)(x ij µ gh )(m 1i + m 2j ) i,j,g,h δ ijφ ij (g, h)(m 1i + m 2j ) 2, [g] k 1 1, [h]k 2 1. (6.19) We have a similar variational EM algorithm as in Chapter In particular, the algorithm alternates between the E-step, updating variational parameters (γ 1, γ 2, Φ), and the M-step, updating model parameters (α 1, α 2, b, Θ) until convergence. The objective function is guaranteed to be non-decreasing Prediction For each entry x ij, since Φ ij (g, h) gives its mixed-membership to each co-cluster (g, h), k2 h=1 Φ ij(g, h) and k 1 g=1 Φ ij(g, h) give its mixed-membership to the row cluster g and column cluster h respectively. We take the average of row mixed-membership of all entries in a same row i, i.e., 1 w 1i j,h δ ijφ ij (g, h), as the mixed-membership to row clusters for row i. Similarly, we use 1 w 2j i,g δ ijφ ij (g, h) as the mixed-membership to column clusters for column j. Therefore, the prediction of x ij in an existent row and column is given by k 1 k 2 ( 1 ˆx ij = g=1 h=1 M k 2 w 1i j=1 h =1 )( 1 δ ij Φ ij (g, h ) N k 1 ) δ ij Φ ijg h µ gh + b m 1i + b m 2j. w 2j i=1 g =1 (6.20) We still use (6.13) to predict the matrix, except that G 1 and G 2 have 1 w 1i j,h δ ijφ ij (g, h) and 1 w 2j i,g δ ijφ ij (g, h) in columns respectively. To predict the entries in a new row, we follow a similar strategy as in Chapter Given an N M data matrix X, we first run RBC-FF on it to get the model parameters (α 1, α 2, b, Θ ), then append r new coming rows to X to get an (N + r) M matrix Z. Given the model parameters, an E-step on Z yields (Φ, γ 1, γ 2 ), which are used in (6.20) to predict the missing entries in new rows. The algorithm is also applicable to the new columns.

121 6.3.3 Parallel RBC 105 We parallelize RBC based on RBC-FF. It employs a client-server mode, where multiple clients run parts of the algorithm in parallel, and pass intermediate results to the server, which then gathers all the intermediate results to do other computations. Since we are running an EM algorithm, the process goes on for several iterations. For brevity, we call the parallelized algorithm parallel RBC instead of parallel RBC-FF. In the E-step, the data matrix is divided to several parts and each part is assigned to one client, which calculates Φ ij (g, h) for only the entries assigned to it. The division and assignment of the data matrix could be arbitrary, as long as each client knows the indexes (i, j) in the original matrix for its entries. In the extreme case, each client can only have one entry. Without loss of generality, we assume that the original data matrix X is divided to d parts and each client c has access to one submatrix Y c, [c] d 1. In addition, each client also needs model parameters α 1, α 2, b, and Θ, as well as {m 1i, m 2j, [i] N 1, [j]m 1 }. Given γ 1 and γ 2, each client c can calculate Φ c = {Φ ij (g, h), x ij Y c } for all the entries assigned to it using (6.14). The client then calculates the summation as Φ c 1ig = j k 2 h=1 x ij Y c Φ c 2jh = i x ij Y c k 1 g=1 δ ij Φ c ij(g, h) (6.21) δ ij Φ c ij(g, h), (6.22) for [g] k 1 1, [h]k 2 1, [i]n 1, [j]m 1. The clients pass the summations to the server. The server updates γ 1 and γ 2 using these intermediate summations without accessing Φ ij (g, h): γ 1ig = α 1g + γ 2jh = α 2h + d Φ c 1ig (6.23) c=1 d Φ c 2jh, (6.24) which are equivalent with (6.15) and (6.16) respectively. The updated γ 1 and γ 2 are passed to the clients for further updating Φ ij (g, h). Another more straightforward option to parallelize the E-step is to calculate Φ c in each client and pass all these {Φ c, [c] d 1 } to the server for updating γ 1 and γ 2. However, it has two limitations compared to our c=1

122 strategy: First, for very large matrices with millions of entries, the number of Φ ij (g, h) is huge, which will be a huge communication cost to pass them from clients to the server for several iterations. Second, the computation for γ 1 and γ 2 cannot be parallelized. In our algorithm, we make most of the computation (both E-step and M-step) in parallel, and avoid passing the huge number of parameters Φ ij (g, h) back and forth. In the M-step, the updating of α 1 and α 2 is performed on the server side following (6.7) and (6.8). dependencies with each other. b = 106 For the updating of b and Θ as in (6.17)-(6.19), there are However, a closer examination gives us such observation: {µ gh, [g] k 1 1, [h]k 2 1 } and b depend on each other, but they do not depend on {σgh 2, [g]k 1 1, [h]k 2 1 }. Therefore, we can run (6.17) and (6.19) iteratively until convergence to get {µ gh, [g] k 1 1, [h]k 2 1 } and b, which are then used to calculate {σ2 gh, [g]k 1 1, [h]k 2 1 } using (6.18) in one step. We expand (6.17)-(6.19) to get the expressions as follows: i,j µ gh = δ ijφ ij (g, h)x ij b i,j δ ijφ ij (g, h)(m 1i + m 2j ) i,j δ ijφ ij (g, h) ( σgh 2 = δ ij Φ ij (g, h)x 2 ij + b 2 δ ij Φ ij (g, h)(m 1i + m 2j ) 2 + µ 2 gh δ ij Φ ij (g, h) i,j i,j i,j 2b δ ij Φ ij (g, h)x ij (m 1i + m 2j ) 2µ gh δ ij Φ ij (g, h)x ij i,j i,j ) + 2bµ gh δ ij Φ ij (g, h)(m 1i + m 2j ) / δ ij Φ ij (g, h) i,j i,j g,h i,j δ ijφ ij (g, h)x ij (m 1i + m 2j ) g,h µ gh i,j δ ijφ ij (g, h)(m 1i + m 2j ) Letting g,h i,j δ ijφ ij (g, h)(m 1i + m 2j ) 2 A c gh = δ ij Φ c ij(g, h)x ij (6.25) x ij Y c Bgh c = δ ij Φ c ij(g, h)(m 1i + m 2j ) (6.26) x ij Y c Cgh c = δ ij Φ c ij(g, h) (6.27) x ij Y c Dgh c = δ ij Φ c ij(g, h)x 2 ij (6.28) x ij Y c Egh c = δ ij Φ c ij(g, h)(m 1i + m 2j ) 2 (6.29) x ij Y c

123 we have c µ gh = Ac gh b c Bc gh c Cc gh F c gh = 107 x ij Y c δ ij Φ c ij(g, h)x ij (m 1i + m 2j ), (6.30) ( σ gh = Dgh c + b2 Egh c +µ2 gh Cgh c 2b Fgh c 2µ gh A c gh +2bµ gh c c c c c c b = g,h c F gh c g,h µ gh g,h c Ec gh c Bc gh B c gh (6.31) ) / Cgh c c (6.32). (6.33) The M-step is performed on clients and server as follows: First each client c calculates {A c gh, Bc gh, Cc gh, Dc gh, Ec gh, F gh c, [g]k 1 1, [h]k 2 1 } and passes them to the server. The server alternates between (6.31) and (6.33) till convergence to get updated {µ gh, [g] k 1 1, [h]k 2 1 } and b, it then updates {σgh 2, [g]k 1 1, [h]k 2 1 } using (6.32). By organizing update equations as in (6.31)-(6.33), there are two advantages: First, most part of updates (A c... F c ) are calculated in parallel on the clients for only once in each M-step, so the computation amount is very small. Second, The updates only depend on (A c... F c ), which are 6 k 1 k 2 matrices, instead of on Φ, which contains S k 1 k 2 matrices. In real data matrices where S 6, the communication cost for passing (A c... F c ) is orders of magnitude smaller than passing Φ. Putting everything together, to run parallel RBC, the server needs access to the whole data matrix X, and each client c needs access to a submatrix Y c with their indexes (i, j) in X, initial values for the model parameters (α 1, α 2, b, Θ), as well as common initial values of the variational parameters γ 1 and γ 2 to start the E-step. The algorithm runs as follows: 1. E-step: (a) Each client c calculates Φ c for each x ij Y c using (6.14). (b) Each client c calculates {Φ c 1ig, [g]k 1 1 } and {Φc 2jh, [h]k 2 1 } following (6.21) and (6.22), and passes them to the server. (c) Server updates γ 1 and γ 2 following (6.23) and (6.24), and passes them to the clients. (d) Go back to (a) until convergence.

124 2. M-step: 108 (a) Each client c calculates (A c... F c ) for [g] k 1 1, [h]k 2 1 and passes them to the server. (b) The server alternates between (6.31) and (6.33) until convergence to get updated {µ gh, [g] k 1 1, [h]k 2 1 } and b, then updates {σ2 gh, [g]k 1 1, [h]k 2 1 } following (6.32). The server passes b and Θ to the clients. (c) The server updates α 1 and α 2 following (6.7) and (6.8) till convergence and passes them to the clients. 3. Go back to E-step until convergence. For prediction of entries in existent rows and columns, since the server has the final {Φ c 1ig, [i]k 1 1 } and {Φc 2jh, [j]k 2 1 } passed from the clients. It can use k 1 k 2 ( 1 ˆx ij = i=1 j=1 d Φ c 1ig w 1i c=1 )( 1 d Φ c 2jh w 2j c=1 ) µ gh + b m 1i + b m 2j (6.34) to do prediction, which is equivalent to (6.20). For prediction of entries in new rows, the non-missing entries in new rows will be assigned to the clients, as well as the updated means of rows and columns. The clients run a parallel E-step to get new {Φ c 1ig, [c]k 1 and {Φ c 2jh, [h]k 2 1 } and pass them to the server. The server predicts the missing entries using (6.34). 1 } 6.4 Experimental Results In this section, we present experimental results of missing value prediction for RBC and RBC-FF, and running time for parallel RBC, compared to other algorithms. Missing value prediction includes the prediction for entries in existent rows and columns, as well as in new rows and columns. We use the same data sets as we used for BCC: Jester, Movielens, and Foodmart. We use the original data matrix for Jester and Foodmart, and for Movielens, following [4], we covert each rating x ij to 6 x ij such that the distribution is closer to Gaussian.

125 109 MSE SpecC5 BregC1 BregC2 BregC3 BregC4 BregC5 BCC RBC RBC FF MSE SVD NNMF CORR RBC RBC FF (10,5) (15,10) (20,15) (25,20) (30,25) (35,30) (k1,k2) (a) RBC with co-clustering algorithms 16.5 (10,5) (15,10) (20,15) (25,20) (30,25) (35,30) (k1,k2) (b) RBC with other algorithms Figure 6.3: MSE on Jester compared to different algorithms with different choices of (k 1, k 2 ) Missing Value Prediction for Existent Rows and Columns We compare RBC and RBC-FF to other algorithms in terms of missing value prediction for existent rows and columns. We run the algorithms using 10-fold cross validation. The training data has at least one entry of the original data matrix in each row and column, so the prediction is performed on existent rows and columns. RBC vs. Other Co-clustering Algorithms We first compare RBC and RBC-FF to other co-clustering algorithms: Bregman coclustering [10] with 6 schemes (C1-C6), Bayesian co-clustering [112], and co-clustering based on spectral graph partitioning [37]. For Bregman co-clustering with different schemes, missing value prediction is based on the summary statistics it preserves in each scheme (BregC1-BregC6). For Bayesian co-clustering, it uses ˆx ij = g,h ϕ 1igϕ 2jh µ gh to do prediction. For spectral graph partitioning, we use two prediction strategies following C2 and C5 (SpecC2 and SpecC5). In particular, given row cluster g for row i and column cluster h for column j, C2 predicts x ij following ˆx ij = µ gh, and C5 predicts x ij following ˆx ij = µ gh + m 1i + m 2j µ 1g µ 2h, where µ 1g and µ 2h are the mean for row cluster g and column cluster h respectively. We compare these algorithms using different number of row clusters k 1 and column clusters k 2. Since spectral graph partitioning keeps the same number of row and column clusters, we set the cluster number to be max(k 1, k 2 ). The average mean square error (MSE) over 10-fold cross validation on Jester is shown in Figure 6.3(a). To avoid clutter on the plots, we do not show the results

126 110 corresponding to SpecC2 and BregC6, which have very poor performance. Table 6.2 and 6.3 show the results on Movielens and Foodmart for all eleven algorithms. As a baseline, The average MSE on Jester, Movielens and Foodmart using the mean of the training data are , , and respectively. The observations are as follows: 1. Overall, RBC and RBC-FF achieve lower MSE than other co-clustering algorithms in almost all cases, including different datasets and different number of clusters. The only exception is on Jester with (k 1, k 2 ) = (10, 5), where BregC5 has the lowest MSE. Such observations indicate that RBC can generate more accurate predictions than other co-clustering algorithms. 2. The fact that RBC and RBC-FF have better performance than BCC shows the advantage to do co-clustering on residue after removing the biases of the rows and columns. 3. The fact that RBC and RBC-FF have better performance than Bregman coclustering is probably because they are mixed-membership models. Allowing each row (column) to belong to multiple row (column) clusters with varying degrees may yield a more reasonable co-clustering, which further helps missing value prediction. Another evidence for this viewpoint is that BCC, which is the mixed-membership version of BregC2, achieves lower MSE than BregC2. 4. The performance of RBC and RBC-FF improves as k 1 and k 2 increase, because more parameters are used for approximation with a larger k 1 or k The results from RBC and RBC-FF are very close to each other, indicating that parallel RBC, which is built on RBC-FF, does not sacrifice accuracy. RBC vs. Other Prediction Algorithms We then compare RBC and RBC-FF to other missing value prediction algorithms, such as singular value decomposition (SVD) [55], non-negative matrix factorization (NNMF) [77], and correlation based algorithms (CORR) [110]. SVD and NNMF predict the missing entries based on low parameter matrix approximation, and CORR tries to

127 Table 6.2: MSE on Movielens compared to other co-clustering algorithms with different choices of (k 1, k 2 ). 111 k1, k2 SpecC2 SpecC5 BregC1 BregC2 BregC3 BregC4 BregC5 BregC6 BCC RBC RBC-FF 5, ± ± ± ± ± ± ± ± ± ± ± , ± ± ± ± ± ± ± ± ± ± ± , ± ± ± ± ± ± ± ± ± ± ± Table 6.3: MSE on Foodmart compared to other co-clustering algorithms with different choices of (k 1, k 2 ). k1, k2 SpecC2 SpecC5 BregC1 BregC2 BregC3 BregC4 BregC5 BregC6 BCC RBC RBC-FF 10, ± ± ± ± ± ± ± ± ± ± ± , ± ± ± ± ± ± ± ± ± ± ± , ± ± ± ± ± ± ± ± ± ± ± find similar rows as neighbors of the row with the missing entry, and predicts the missing entry using a combination of the neighbors. The missing entries are filled with means of non-missing entries in a same row for SVD and NNMF. The rank of SVD and NNMF is set to be max(k 1, k 2 ). For CORR, the number of neighbors is fixed to be 100. Since the ratings in Jester is from -10 to 10, we first add 11 to the training data to run NNMF, and then subtract 11 from the predicted value to get the real prediction. Figure 6.3(b) shows the average MSE over 10-fold cross validation on Jester, and Table 6.4 and 6.5 show the results on Movielens and Foodmart. On Jester, when (k 1, k 2 ) is small, SVD has the lowest MSE. When (k 1, k 2 ) increases, MSE of SVD increases, but MSE of RBC and RBC-FF decreases and becomes the lowest. On Movielens, RBC and RBC-FF perform the best among all other algorithms. On Foodmart, SVD is the best, but RBC and RBC-FF still achieve lower MSE than NNMF and CORR. Therefore, overall, RBC and RBC-FF have better performance than NNMF and CORR, and competitive performance with SVD, depending on the datasets and parameters, and as we will show later, parallel RBC scales substantially more efficiently than SVD.

128 Table 6.4: MSE on Movielens compared to SVD, NNMF and CORR with different choices of (k 1, k 2 ). k1, k2 SVD NNMF CORR RBC RBC-FF 5, ± ± ± ± ± , ± ± ± ± ± , ± ± ± ± ± Table 6.5: MSE on Foodmart compared to SVD, NNMF, and CORR with different choices of (k 1, k 2 ). k1, k2 SVD NNMF CORR RBC RBC-FF 10, ± ± ± ± ± , ± ± ± ± ± , ± ± ± ± ± Missing Value Prediction for New Rows and Columns We show the result for RBC and RBC-FF in the missing value prediction for new rows and columns on Jester. SVD and NNMF can not handle such cases, neither can Bregman co-clustering and spectral graph partitioning. Therefore, we only compare our algorithm to BCC. Each time we randomly hold out 5 rows (columns) as the test set, and train the model using the rest of data. After that, given three random entries in each row (column) of the test set, we try to predict the rest of the entries. We repeat the process for 10 times. The average MSE for RBC, RBC-FF and BCC are presented in Figure 6.4. As a baseline, the average MSE for rows and columns are and respectively. The observations are as follows: 1. Overall, RBC and RBC-FF have a lower MSE than BCC, indicating that other than the better performance on existent rows and columns, RBC and RBC-FF also generate more accurate prediction than BCC on new rows and columns. 2. Different from the results in Figure 6.3(a), where RBC and RBC-FF have similar performance, in predicting the entries for new rows and columns, RBC-FF

129 113 MSE BCC RBC RBC FF 20 (10,5) (15,10) (20,15) (25,20) (30,25) (35,30) (k1,k2) (a) new user MSE BCC RBC RBC FF 20 (10,5) (15,10) (20,15) (25,20) (30,25) (35,30) (k1,k2) (b) new joke Figure 6.4: MSE of RBC and BCC for missing value prediction for new users and jokes on Jester. performs better than RBC. 3. Another difference is that in Figure 6.3(a), the MSE of three algorithms decreases as (k 1, k 2 ) increases, while in Figure 6.4, the overall trend of the curves is going up, although not very clear, especially for BCC. A possible reason is that when (k 1, k 2 ) increases, it is getting more and more difficult to figure out the correct mixed membership to all row (column) clusters for the new rows (columns) with very few entries given. 4. The prediction on new jokes is better than that on new users, which indicates that it is easier to predict the rest of the ratings each joke will receive from different users, than to predict the rest of the ratings each user will give to different jokes. This observation is surprising since for each new joke, we need to predict 997 ratings given 3, while for each new user, we only need to predict 97 ratings given 3. Intuitively, it might be because the ratings each joke gets from different users is more consistent than the ratings each user gives to different jokes Running Time of Parallel RBC We show the running time comparison between parallel RBC and SVD. We randomly generate nine matrices with increasing scales from to in steps of 500. To simulate the parallel RBC using one processor, we run RBC-FF and take the running

130 SVD Parallel RBC Running Time (sec) Matrix Scale 2 Figure 6.5: Running time of parallel RBC and SVD on different scale of matrices. time which would be spent on the client side and server side, denoting as T c and T s respectively, so the total time for parallel RBC would be approximately T c /d+t s, where d is the number of clients. Assuming there are ten processors, the running time of RBC and SVD is shown in Figure 6.5. When the data matrix is small, both RBC and SVD run very fast, and SVD is slightly faster. When the scale of matrix increases, the running time of SVD increases rapidly, as shown by the steep curve. Comparatively, the curve for parallel RBC only goes up slowly, and its advantage in terms of computational efficiency becomes more and more distinct. In the experiment, we ignored the communication overhead, so the numbers shown in the figure might not be very accurate. However, the trend of the curves is clear. Moreover, in our experiment, we assume that there are only ten processors. In large scale systems with hundreds of processors, parallel RBC could be orders of magnitude faster. 6.5 Conclusion In this chapter, we have introduced residual Bayesian co-clustering for matrix approximation. It extends Bayesian co-clustering by taking row and column bias into consideration, hence captures a more reasonable generative process of the matrix. Two variational inference algorithms are proposed. One shares a variational distribution among all entries in the same row/column to keep dependency between rows and columns. The other

131 115 uses a fully factorized variational distribution, which reduces such dependency and leads to parallel RBC. In the experiments of missing value prediction, RBC generates more accurate prediction than several other co-clustering algorithms, and competitive results with matrix factorization based algorithms, such as SVD and NNMF. Moreover, SVD and most of other algorithms cannot do prediction for new rows and columns while RBC can naturally handle such problems. In addition, parallel RBC is much more efficient than SVD, making it applicable to very large matrices in real data.

132 Chapter 7 Parametric Probabilistic Matrix Factorization In previous chapters, we have discussed several clustering algorithms. Starting from this chapter, we are going to discuss matrix factorization algorithms, in particular, matrix factorizations in a probabilistic framework for missing value prediction. All these algorithms can be considered as extensions of probabilistic matrix factorization (PMF) [107]. In recent years, matrix factorization methods have been successfully applied to collaborative filtering [72]. For example, in movie recommendation, given a rating matrix, the idea is to predict any missing entry (i, j) with the inner product of latent feature vectors for row (user) i and column (movie) j. The idea has been explored by Simon Funk [52], and later a probabilistic framework was developed, yielding probabilistic matrix factorization (PMF) [107] and Bayesian PMF (BPMF) [108]. Both of them have achieved high accuracy in collaborative filtering. In this chapter, we propose parametric PMF (PPMF) based on the following two questions: First, are the prior distributions used in PMF and BPMF suitable, or is it possible to get a better prediction and a simpler algorithm with different priors? PMF assumes a diagonal covariance for the Gaussian prior, implying independent latent features; BPMF maintains a distribution over all possible covariance matrices. A model between PMF and BPMF is parametric PMF (PPMF), which allows a non-diagonal 116

133 117 covariance matrix, but does not maintain distributions over all covariance matrices. We discuss PPMF model in this chapter. The motivation is to avoid the independence assumption in PMF, and avoid the full Bayesian treatment in BPMF to simplify the learning process. Second, are there any benefits to take into account row and column biases in the PMF framework? In residual Bayesian co-clustering, we have shown that incorporating the row and column biases helps to improve the algorithm s performance. In matrix factorization settings, when there are row and column biases, the inner product of latent feature vectors might not be a good explanation for the full rating, but only for the residual rating after taking off the biases. While considering biases in SVD [100] improves prediction performance, we propose residual PPMF which take the biases into PPMF framework. By running experiments on movie recommendation datasets, we show that PPMF performs better than PMF, BPMF, and co-clustering based algorithms, and incorporating the residual improves PPMF s performance even further. 7.1 Preliminaries Consider an N M real-valued matrix X with a number of missing entries. The goal of matrix completion is to predict the values of those missing entries. Probabilistic Matrix Factorization (PMF) [107] approaches this problem from the matrix factorization aspect. Assuming each row i has a latent vector u i R D, and each column j has a latent vector v j R D, the generative process for PMF is given as follows (see Figure 7.1): 1. For each row i in X, [i] N 1, generate u i N (0, σ1 2I), where I RD D denotes the identity matrix. 2. For each column j in X, [j] M 1, generate v j N (0, σ2 2I). 3. For each of S non-missing entries (i, j), generate x i,j N (u T i v j, σ 2 ). The model has zero-mean spherical Gaussian priors on u i and v j, and each entry x ij is generated from a univariate Gaussian with the mean determined by the inner product of u i and v j. Letting U = [u 1, u 2,..., u N ], and V = [v 1, v 2,..., v M ], the log-posterior

134 118 σ 1 2 σ 2 2 u v N X M σ 2 Figure 7.1: The graphical model for PMF. over the latent matrices U R D N and V R D N is given by: log p(u, V X, σ 2, σ 2 1, σ 2 2) (7.1) = 1 2σ 2 N M i=1 j=1 δ ij (x ij u T i v j ) 2 1 2σ 2 1 N i=1 1 2 ( S log σ 2 + ND log σ MD log σ 2 2) + C, u T i u i 1 2σ 2 2 M vj T v j where δ ij is the indicator taking value 1 if x ij is an observed entry, and 0 otherwise, S is the number of non-missing entries in X, and C is a constant that does not depend on the latent matrices U and V. MAP inference maximizes the log-likelihood with respect to U and V, which could then be used to predict the missing entries in X. As an extension of PMF, Bayesian PMF (BPMF) [108] introduces a full Bayesian prior for each u i and each v j. u i (and similarly for v j ) is then sampled from N (µ 1, Σ 1 ), where the hyperparameters {µ 1, Σ 1 } are further sampled from Gaussian-Wishart priors. j=1 7.2 Parametric PMF In this section, we propose parametric PMF. For ease of exposition, we assume that we are working on the movie rating matrix, where the rows represent the users and columns represent the movies. Given a matrix X R N M, assuming the D-dimensional latent feature vector for each user i is u i and for each movie j is v j, the generative process of the matrix X

135 119 µ 1 Σ 1 u µ Σ 2 2 v N X M σ 2 Figure 7.2: The graphical model for PPMF. following PPMF is given as follows (Figure 7.2): 1. For each user i, [i] N 1, generate u i N (µ 1, Σ 1 ). 2. For each movie j, [j] M 1, generate v j N (µ 2, Σ 2 ). 3. For each of S non-missing entries in X, generate x ij N (u T i v j, σ 2 ). The likelihood of X is given by p(x P) = u 1...u N v 1...v M N i=1 j=1 N M p(u i µ 1, Σ 1 ) p(v j µ 2, Σ 2 ) (7.2) i=1 j=1 M p(x ij u T i v j, σ 2 ) δ ij du 1... du N dv 1... dv M, where P = {µ 1, Σ 1, µ 2, Σ 2, σ 2 } are the model parameters, and δ ij is 1 if x ij is nonmissing and 0 otherwise. Given X, since the likelihood p(x P) is intractable, we use variational expectation maximization (EM) for learning and inference. Letting U = [u 1, u 2,..., u N ], and V = [v 1, v 2,..., v M ], we introduce a tractable family of distributions q(u, V P ) as an approximation of the true posterior p(u, V X, P), where P denotes the variational parameters. In particular, we introduce a variational Gaussian distribution N (λ 1i, diag(ν 2 1i )) to generate u i (λ 1i R D and ν 2 1i RD ), and N (λ 2j, diag(ν 2 2j )) to generate v j (λ 2j R D and ν 2 2j RD ), where diag( ) denotes a square diagonal matrix

136 with on the diagonal, and ν 2 1i RD has each dimension d being ν 2 1id (Similar for ν2 2j ). Therefore, the variational distribution q is given by N M q(u, V P ) = q(u i λ 1i, diag(ν1i)) 2 q(v j λ 2j, diag(ν2j)) 2, i=1 where P = {λ 1i, ν1i 2, λ 2j, ν2j 2, [i]n 1, [j]m 1 }, and {λ 1i, ν1i 2 } and {λ 2j, ν2j 2 } are the parameters for variational Gaussian distributions. Given q(u, V P ), applying Jensen s inequality [25] yields a lower bound to the loglikelihood logp(x P) E q [log p(u, V, X P)] + H(q(U, V P )) i=1 j=1 N M = E q [log p(u i µ 1, Σ 1 )] + E q [log p(v j µ 2, Σ 2 )] + i=1 j=1 N i=1 j=1 120 M E q [log p(x ij u T i v j, σ 2 )] N M E q [log q(u i λ 1i, diag(ν1i))] 2 E q [log q(v j λ 2j, diag(ν2j))] 2 (7.3) j=1 Each term in (7.3) can be expanded as follows: N E q [log p(u i µ 1, Σ 1 )] = 1 2 i=1 N i=1 { Tr(Σ 1 1 ν2 1i) + (λ 1i µ 1 ) T Σ 1 1 (λ 1i µ 1 ) } ND 2 log 2π + N log Σ (7.4) N M E q [log p(x ij u T i v j, σ 2 )] δ ij = 1 N M 2σ 2 δ ij (x 2 ij 2x ij λ T 1iλ 2j + E q [(u T i v j ) 2 ]) i=1 j=1 i=1 j=1 S 2 log 2πσ2 (7.5) N E q [log q(u i λ 1i, diag(ν1i))] 2 = 1 2 DN 1 2 DN log 2π + 1 N log diag(ν 2 2 1i) 1 i=1 i=1 (7.6) The terms M j=1 E q[log p(v j µ 2, Σ 2 )] and M j=1 E q[log q(v j λ 2j, diag(ν2j 2 ))] have a similar form with (7.4) and (7.6) respectively. Given the lower-bound function in (7.3), taking derivative w.r.t. variational parameters and model parameters yields the update equations for variational E-step and

137 M-step respectively. In particular, The variational EM algorithm then iterates through E-step and M-step as follows: 121 E-step: Denoting the lower bound (7.3) with L(P, P ), the best lower bound can be found by maximizing L(P, P ) over P, which gives ( λ 1i = Σ M ) 1 ( σ 2 δ ij (λ 2j λ T 2j + diag(ν2j)) 2 Σ 1 1 µ σ 2 ν 2 1id = ( M j=1 j=1 M ) δ ij x ij λ 2j j=1 (7.7) δ ij (λ 2 2jd + ν2 2jd )/σ2 + Σ 1 1,dd) 1, [i] N 1, [d] D 1. (7.8) where Σ 1 1,dd is the entry (d, d) of Σ 1 1. λ 2j and ν2jd 2 have a similar form. Note that although the covariance matrices for variational Gaussians are diagonal, the model parameters Σ 1 and Σ 2 are not diagonal, so PPMF is able to capture the correlation among latent factors. M-step: P from the E-step gives us a surrogate objective function L(P, P ), optimizing L(P, P ) over P yields the estimate of the model parameters: µ 1 = 1 N λ 1i (7.9) N Σ 1 = 1 N i=1 N ( diag(ν 2 1i ) + (λ 1 µ 1 )(λ 1 µ 1 ) T ) (7.10) i=1 σ 2 = 1 S N M δ ij (x 2 ij +λ T 1idiag(ν2j)λ 2 1i +λ T 2jdiag(ν1i)λ 2 2j i=1 j=1 ) 2x ij λ T 1iλ 2j + (λ T 1iλ 2j ) 2 + Tr(diag(ν1i)diag(ν 2 2j)) 2 (7.11) where S is the total number of non-missing entries in X, and Tr( ) is trace of the matrix. The expressions for µ 2 and Σ 2 are similar with µ 1 and Σ 1. To learn the model, the algorithm iterates through the E-step and M-step until convergence. In the E-step, the algorithm updates λ 1, λ 2, and ν 1, ν 2 alternatively till convergence. The time complexity of each E-step is O(D 2 (DM +DN +MN)t E ), where D is the dimension of u and v, and t E is the number of iterations inside the E-step. The time complexity of each M-step is O(D 2 N + D 2 M + DMN). We compare PPMF to PMF and BPMF: PMF only uses a zero mean and diagonal covariance Gaussian priors over u and v for regularization, and it uses MAP estimate

138 122 µ 1 Σ 1 u µ Σ 2 2 v N f R g M m 1 σ2 1 m 2 σ2 Figure 7.3: Graphical model for residual PPMF. σ 2 to infer the best U and V directly with Gaussian priors fixed upfront. Comparatively, PPMF has a Gaussian prior with an arbitrary mean and a full covariance matrix. It learns model parameters through MLE by maximizing the log-likelihood of X, which integrates out all possible U and V, and then infers the best U and V given the Gaussian parameters learnt from the model. For BPMF, it uses a full Bayesian treatment with hyperparameters on top of the Gaussian priors, which essentially keeps a distribution over all possible PPMF models. Therefore, PPMF lies between PMF and BPMF. Meanwhile, the variational inference for PPMF is a deterministic approximation algorithm, and the Markov chain Monte Carlo used in BPMF is a stochastic sampling based algorithm. For prediction, we are using a MAP estimate. In particular, for the estimate of the (i, j) th entry, ˆx ij = û T i ˆv j, where {û i, ˆv j } = arg max (u i,v j ) so we have ˆx ij = λ T 1i λ 2j. p(u i, v j X, P) argmax (u i,v j ) q(u i, v j P ) = {λ 1i, λ 2j }, 7.3 Residual PPMF As we have discussed in Chapter 6, there are usually biases in the ratings. For example, a popular movie usually receives high ratings, and a critical user usually gives low ratings. Therefore, it may be unwise to explain the full rating x ij using the inner product of u i

139 and v j. Instead, we can use user and movie biases to explain a certain part of x ij, and use u T i v j to explain the residue after taking off the biases, which gives residual PPMF (rsppmf). The main idea is as follows: instead of generating x ij from N (u T i v j, σ 2 ), in the residual models, x ij is generated from N (u T i v j + f i + g j, σ 2 ), where f i and g j are the row and column biases, and are assumed to be generated from N (m 1, σ 2 1 ) and N (m 2, σ2 2 ) respectively. Therefore, the matrix factorization is performed on X after the effects of f i and g j being removed. As in Figure 7.3, the generative process of rsppmf for the matrix X with N users and M movies is as follows: 1. For each user i, [i] N 1, generate u i N (µ 1, Σ 1 ). 2. For each movie j, [j] M 1, generate v j N (µ 2, Σ 2 ). 3. For each user i, [i] N 1, generate f i N (m 1, σ 2 1 ). 4. For each movie j, [j] M 1, generate g j N (m 2, σ 2 2 ). 5. For each non-missing entry (i, j) in X, generate x ij N (u T i v j + f i + g j, σ 2 ). The likelihood of observing X is therefore ( N ) p(x P) = p(u i µ 1, Σ 1 )p(f i m 1, σ1) 2 u 1...u N v 1...v M f 1...f N g 1...g M i=1 ( M ) N M p(v j µ 2,Σ 2 )p(g j m 2,σ2) 2 p(x ij u T i v j +f i +g j, σ 2 ) δ ij j=1 i=1j=1 123 du 1... u N dv 1... v M df 1... f N dg 1... g M (7.12) where P = {µ 1, Σ 1, µ 2, Σ 2, σ 2, m 1, σ 2 1, m 2, σ 2 2 }. For inference, based on the variational distribution we have used for PPMF, we introduce two new terms to the variational distribution: a variational Gaussian N (θ 1i, η 2 1i ) for row bias f i and N (θ 2j, η 2 2j ) for column bias g j. The variational distribution hence becomes: q(u, V P ) = ( N i=1 j=1 q ( u i λ 1i, diag(ν1i) 2 ) q ( f i θ 1i, η1i 2 ) ) M q ( v j λ 2j, diag(ν2j) 2 ) q ( g j θ 2j, η2j 2 ), (7.13)

140 where P = {λ 1i, ν 2 1i, λ 2j, ν 2 2j, θ 1i, η 2 1i, θ 2j, η 2 2j, [i]n 1, [j]m 1 }. In particular, (λ 1i, diag(ν 2 1i )) and (λ 2j, diag(ν2j 2 )) are the parameters for multivariate variational Gaussian distributions, and (θ 1i, η1i 2 ) and (θ 2j, η2j 2 ) are the parameters for univariate Gaussian distributions. The updating equations for ν 1, ν 2, µ 1, µ 2 and Σ 1, Σ 2 are the same as in PPMF. For the rest parameters, we have: E-step: ( λ 1i = Σ σ 2 θ 1i =( 1 σ 2 1 η 2 1i =( 1 σ 2 1 M j=1 ( Σ 1 1 µ σ σ 2 M j=1 ) 1 δ ij (λ 2j λ T 2j + diag(ν2j)) 2 M ) δ ij (x ij θ 1i θ 2j )λ 2j j=1 δ ij ) 1 ( m1 σ M ) σ 2 (x ij θ 2j λ T 1iλ 2j ) j=1 124 (7.14) (7.15) + 1 M ) 1 σ 2 δ ij. (7.16) j=1 The expressions for λ 2j, θ 2j and η 2 2j M-step: σ 2 = 1 S m 1 = 1 N σ 2 1 = 1 N N i=1 j=1 are similar. M δ ij (x 2 ij + λ T 1idiag(ν2j)λ 2 1i + λ T 2jdiag(ν1i)λ 2 2j 2x ij (λ T 1iλ 2j + θ 1i + θ 2j ) + (λ T 1iλ 2j + θ 1i + θ 2j ) 2 + Tr(diag(ν 2 1i)diag(ν 2 2j)) + η 2 1i + η 2 2j N i=1 θ 1i ) (7.17) (7.18) N ((θ 1i m 1 ) 2 + η1i) 2. (7.19) i=1 The expressions for m 2, and σ2 2 are similar. For prediction of residual PPMF following MAP estimate, we have ˆx ij = û T i ˆv j + ˆf i + ĝ j.

141 7.4 Experimental Results 125 In this section, we compare PPMF with two types of algorithms in terms of missing value prediction on movie rating data: One is the co-clustering algorithms, and the other is matrix factorization algorithms, in particular, PMF and BPMF 1. For the data set, we use Million-movielens data set 2 which contains 1 million ratings for 3900 movies by 6040 users. For evaluation, we use mean square error (MSE) as the measurement of prediction accuracy on the rating matrix. A small part of the ratings is held out as the validation set, which is used in the training process to decide the stopping time for variational EM iterations. In particular, we stop the variational EM when the number of iterations is larger than 3 (because we do not want the iteration to stop too early) and the MSE on the validation set is larger than the last iteration. For the rest of the ratings, we use a 10-fold cross validation. We then take the average MSE over 10 folds on the test set. Before running the algorithm, we transform each rating x ij to 6 x ij, such that the ratings are closer to a Gaussian distribution [4] PPMF vs. Co-clustering Algorithms We first compare PPMF to co-clustering based algorithms. (6.13) in Chapter 6 shows how to use co-clustering results for missing value prediction. In our experiment, we compare PPMF with three co-clustering algorithms: spectral co-clustering (speccoc) [37], Bregman co-clustering (BregCoc) [10], and Bayesian co-clustering (BCC) [112]. use two schemes (s2 and s5) for speccoc and BregCoc, with different schemes keeping different types of statistics [10]. For initialization of co-clustering algorithms, we run k-means on the low-rank vectors from imputed singular value decomposition (SVD), and use the result membership vectors for initialization. For PPMF, we use random initialization. The MSE with D from 5 to 30 are presented in Table 7.1, where D is the dimension of u and v for PPMF and the number of row/column clusters for the coclustering algorithms. We can see that PPMF clearly generates a smaller MSE compared 1 For algorithms we compare with, the code for spectral co-clustering is from and the code for other algorithms are from the authors of original papers. 2 We

142 126 Table 7.1: MSE from PPMF and co-clustering based algorithms. D speccoc s2 ± ± ± ± ± ± speccoc s5 ± ± ± ± ± ± BregCoc s2 ± ± ± ± ± ± BregCoc s5 ± ± ± ± ± ± BCC ± ± ± ± ± ± PPMF ± ± ± ± ± ± Table 7.2: MSE from PMF, BPMF and PPMF. D PMF BPMF PPMF ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± to the co-clustering based algorithms. Co-clustering algorithms represent the neighborhood based algorithms since similar rows/columns will have similar membership vectors. While they can learn the clustering structures of a matrix, their matrix approximation results are usually not as good as the matrix factorization based algorithms PPMF vs. PMF and BPMF We then compare PPMF with PMF and BPMF. For BPMF, we do not use the early stopping strategy since it hurts the performance. The MSE of these three algorithms using random initialization are presented in Table 7.2. We can see that PPMF performs better than PMF, and it is surprising to see that PPMF performs even better than BPMF, which gives us supportive evidence to go with PPMF instead of a full Bayesian model as in BPMF, but more rigorous experiments will be needed to fully compare the performance of these algorithms.

143 127 Table 7.3: MSE from PPMF and residual PPMF. D PPMF rsppmf ± ± ± ± ± ± ± ± ± ± ± ± PPMF vs. Residual PPMF The results of PPMF compared to residual PPMF with different choices of D are presented in Table 7.3. We can see that the residual models always have high accuracy than original models, indicating that incorporating the row and column biases into the model helps to improve the prediction accuracy. Such results are consistent with what we have observed for Bayesian co-clustering and residual Bayesian co-clustering. 7.5 Conclusion In this chapter, we have discussed parametric probabilistic matrix factorization. Instead of a zero mean and diagonal covariance Gaussian prior in PMF, PPMF uses a Gaussian prior with arbitrary mean and full covariance matrix. Meanwhile, it does not keep a distribution over all possible Gaussian priors as in BPMF, so it has a simpler formulation than BPMF. We have also discussed the residual PPMF, where the factorization is done after removing the row and column biases from the data matrix. We use MLE to learn the parameters through a variational EM algorithm. Experimental results on movie rating data show that PPMF has higher accuracy in missing value prediction than coclustering algorithms, and it also outperforms PMF and BPMF. In addition, residual PPMF generates even better results compared to PPMF.

144 Chapter 8 Probabilistic Matrix Factorization with Features While probabilistic matrix factorization based algorithms have achieved a great success for missing value prediction using collaborative filtering, given the multi-way data which have multiple sources of information available other than a single data matrix, an interesting question would be: are there suitable extensions to PMF models to leverage side information of a matrix for better collaborative filtering? The side information could be of different types as follows: First is the feature vectors for row and column objects of the data matrix. For example, genre or cast information of the movies in the movie rating matrix. Second is a graph among row or column objects. For example, a social network among users in the movie rating matrix. Third is the hierarchical structure among row or column objects. For example, the users in a movie rating matrix could be grouped hierarchically based on their locations city, state, country, etc.. Recent years have seen emergence of work in making use of side information of different types with matrix factorization. [127, 3, 101, 2] incorporated features of rows and columns into matrix factorization. In particular, [127] combined traditional collaborative filtering and probabilistic topic modeling for recommending scientific articles, [3] developed a matrix factorization method for recommender systems using LDA priors to regularize the model based on item meta-data and user features, [101] combined matrix 128

145 129 factorization with Dirichlet process mixtures for Netflix Prize problem 1, and [2] incorporated the side information into probabilistic matrix factorization using Gaussian processes for baseball outcome prediction. [57, 82] incorporated a graph into matrix factorization. In particular, [57] used a graph for regularization in matrix factorization, and [82] performed a joint factorization of the target matrix and the adjacency matrix of the graph to combine the information from both sources. There has not been much work in incorporating hierarchical structures into matrix factorization, [88] used hierarchical structure of advertisements for missing value prediction in click through rate matrix. In the following chapters, we will discuss how to combine probabilistic matrix factorization based algorithms with each type of side information respectively. In particular, in this chapter, we will discuss probabilistic matrix factorization with row or column features, hence we refer to the model as PMFF, where the features we consider are in form of discrete tokens. In the movie rating example, the side information could be plot of movies. The main idea is to apply correlated topic models (CTM) to the side information and apply parametric probabilistic matrix factorization (PPMF) to the rating matrix. The coupling between two models comes from the shared latent variables, so the learning and inference process will take into consideration the coupling for a better prediction. In particular, PMFF tries to keep the latent feature vectors for two movies as the factorization results on the rating matrix close to each other, if they have similar side information. In this chapter, we assume that we only have side information on movies, such as movie s plot, but it is straightforward to extend the algorithm to work on a matrix with side information on both sides. Also, in general, the model works on different features in form of discrete tokens, such as cast, genre, etc., but for ease of exposition, we will still use document, words, and topics when describing the model. Experimental results on movie rating data with features on movies show that incorporating features into PPMF improves the prediction accuracy. 1 Prize

146 130 µ 1 Σ 1 u µ Σ 2 2 v z D β ~ G f N X M σ 2 Figure 8.1: Graphical model for PMFF. 8.1 PMFF The main idea in PMFF is as follows: For each movie j, we have v j not only serving as PPMF s latent feature vector for its ratings, but also serving as CTM s membership vector over topics (after logistic transformation) for its corresponding features. Therefore the shared v j for both the ratings and side information of movie j becomes the glue to combine PPMF and CTM together. Given a matrix X with N users and M movies, for each movie j, we have a feature vector f j as a collection of words. Denoting the side information of M movies as F = {f j, [j] M 1 } ([j]m 1 j = 1... M), the generative process of (X, F ) for PMFF is given as follows (Figure 8.1): 1. For each user i in X, [i] N 1, generate u i N (µ 1, Σ 1 ). 2. For each movie j in X, [j] M 1, generate v j N (µ 2, Σ 2 ). 3. For the l th word in f j,[j] M 1 : (a) Generate a topic z jl discrete(logistic(v j )). (b) Generate the word f jl p(f jl β, z jl ). 4. For each non-missing entry (i, j) in X, generate x ij N (u T i v j, σ 2 ). Here logistic(v j ) = exp(v j ) D d=1 exp(v jc) is a logistic function to transform v j to a discrete distribution, and β = {β d, [d] D 1 } are D discrete distributions for D topics over all words

147 in the dictionary. Let P = {µ 1, Σ 1, µ 2, Σ 2, σ 2, β}, the likelihood function of observing X and F is p(x, F P) = u 1...u N i=1 v 1...v M M p(v j µ 2, Σ 2 )p(z j v j )p(f j z j, β) j=1 N N M p(u i µ 1, Σ 1 ) p(x ij u T i v j, σ 2 ) δ ij du 1... u N dv 1... v M, i=1 j=1 where δ ij is the indicator taking value 1 when x ij is non-missing and 0 otherwise. We still use variational inference. The inference and learning for PMFF is similar to that of PPMF, except that we need to introduce discrete(ϕ j ) in the variational distribution to generate z j, so we have the variational distribution as i=1 131 G N M j q(u, V P ) = q(u i λ 1i, diag(ν1i)) 2 q(v j λ 2j, diag(ν2j)) 2 q(z jl ϕ j ), (8.1) j=1 where U = [u 1, u 2,..., u N ], V = [v 1, v 2,..., v M ], Gj l=1 is the number of total words in f j and P = {λ 1i, ν1i 2, λ 2j, ν2j 2, ϕ j, [i] N 1, [j]m 1 }. In particular, {λ 1i, diag(ν1i 2 )} and {λ 2j, diag(ν2j 2 )} are variational parameters for Gaussian distributions, and ϕ j is the parameter for a discrete distribution. We use a single ϕ j to generate all {z jl, [l] G j 1 } for movie j, which follows the fast variational inference idea in Chapter 2. After introducing the variational distribution q, applying Jensen s inequality gives us the lower bound to the log-likelihood: log p(f, X µ 1, Σ 1, µ 2, Σ 2, β, σ 2 ) E q [log p(u, V, Z, F, X P)] E q [log q(u, V, Z P )] N M = E q [log p(u i µ 1, Σ 1 )] + E q [log p(v j µ 2, Σ 2 )] + i=1 j=1 j=1 N i=1 j=1 M E q [log p(x ij u i, v j, σ 2 )] M M + E q [log p(z j v j )] + E q [log p(f j z j, β)] (8.2) i=1 j=1 N M M E q [log q(u i λ 1i, ν1i)] 2 E q [log q(v j λ 2j, ν2j)] 2 E q [log q(z j ϕ j )] j=1 j=1

148 Each term in (8.2) is given as follows: 132 N M i=1 j=1 N E q [log p(u i µ 1, Σ 1 )] = 1 2 i=1 N i=1 ND 2 log 2π + N 2 M E q [log p(v j µ 2, Σ 2 )] = 1 M 2 j=1 j=1 E q [log p(x ij u T i v j, σ 2 )] δ ij = 1 2σ 2 { Tr(Σ 1 1 diag(ν2 1i)) + (λ 1i µ 1 ) T Σ 1 1 (λ 1i µ 1 ) } log Σ 1 1 (8.3) { (λ2j µ 2 ) T Σ 1 2 (λ 2j µ 2 ) } (8.4) + Tr(Σ 1 2 diag(ν2 2j)) MD 2 log 2π + M 2 N i=1 j=1 log Σ 1 2 M δ ij {x 2 ij 2x ij λ T 1iλ 2j + (λ T 1iλ 2j ) 2 + λ T 1i(diag(ν 2 2j))λ 1i + λ T 2j(diag(ν 2 1i))λ 2j + Tr(diag(ν1i)diag(ν 2 2j))} 2 S 2 log 2πσ2 (8.5) G M M D j B E q [log p(f j z j, β)] = 1(f jl b) log β db (8.6) j=1 j=1 d=1 N E q [log q(u i λ 1i, ν1i)] 2 = 1 2 i=1 j=1 N ϕ jd l=1 b=1 i=1 d=1 j=1 d=1 D (log 2π + log ν1id 2 + 1) (8.7) M E q [log q(v j λ 2j, ν2j)] 2 = 1 M D (log 2π + log ν2jd 2 + 1) (8.8) 2 M M E q [log q(z j ϕ j )] = G j j=1 j=1 D d=1 ϕ jd log ϕ jd, (8.9) where 1(f jh b) in (8.6) is an indicator taking value 1 if f jl is the b th word of B words in the dictionary and 0 otherwise. For M j=1 E q[log p(z j v j )], we have G M j ( M D ( D )]) E q [log p(z jl v j )] = G j λ 2jd ϕ jd E q [log exp(v jd ). j=1 l=1 j=1 d=1 d=1

149 Expanding log( D d=1 exp(v jd)) at ξ j following Taylor expansion yields ( D )] ( D ) E q [log exp(v jd ) = log exp(ξ jd ) + (λ 2j ξ j ) T exp(ξ j ) D d=1 exp(ξ jd) d=1 d=1 where H(ξ j ) is the Hessian matrix given by Tr(H(ξ j)diag(ν 2 2j)) (λ 2j ξ j ) T H(ξ j )(λ 2j ξ j ), H(ξ j ) = diag(exp(ξ j)) D d=1 exp(ξ jd) exp(ξ j) exp(ξ j ) T ( D d=1 exp(ξ jd)) Therefore, the term M j=1 E q[log p(z j v j )] in (8.2) is given by M M E q [log p(z j v j )]= G j j=1 j=1 D d=1 λ 2jd ϕ jd ( M D ) G j {log exp(ξ jd ) j=1 d=1 + (λ 2j ξ j ) T exp(ξ j ) D d=1 exp(ξ + 1 jd) 2 Tr(H(ξ j)diag(ν2j)) } 2 (λ 2j ξ j ) T H(ξ j )(λ 2j ξ j )) (8.10) Maximizing the lower bound function in (8.2) w.r.t. variational and model parameters yields the update equations. In the E-step, the variational parameters are updated following: λ 1i = Σ σ 2 j=1 M δ ij (λ 2j λ T 2j + diag(ν2j)) 2 j=1 M ν1id 2 = δ ij (λ 2 2jd + ν2 2jd )/σ2 + Σ 1 λ 2j = ( ( Σ σ 2 1,dd 1 1 Σ 1 1 µ σ 2 N δ ij (λ 1i λ T 1i + diag(ν1i)) 2 + G j H(ξ j ) i=1 Σ 1 2 µ σ 2 N i=1 ) 1 M δ ij x ij λ 2j (8.11) j=1 δ ij x ij λ 1i G j exp(ξ j ) D d=1 exp(ξ jd) + G j H(ξ j )ξ j + G j ϕ j ) /( ( D ) ν 2jd = 1 G j exp(ξ jd ) exp(ξ jd ) exp(ξ jd ) d =1 + Σ 1 2,dd + 1 σ 2 (8.12) (8.13) ) N δ ij (λ 2 1id + ν2 1id ) i=1 (8.14)

150 ( ϕ jd exp λ 2jd + 1 Gj G j l=1 b=1 B ) 1(f jl b) log β db In each iteration of EM, ξ j takes the value of λ 2j from the last iteration. 134 (8.15) For update equations in the M-step, the updates for µ 1, µ 2, Σ 1, and Σ 2 are the same as in PPMF, i,e., µ 1 = 1 N Σ 1 = 1 N N λ 1i (8.16) i=1 N ( diag(ν 2 1i ) + (λ 1 µ 1 )(λ 1 µ 1 ) T ) (8.17) i=1 σ 2 = 1 S N M δ ij (x 2 ij +λ T 1idiag(ν2j)λ 2 1i +λ T 2jdiag(ν1i)λ 2 2j i=1 j=1 ) 2x ij λ T 1iλ 2j + (λ T 1iλ 2j ) 2 + Tr(diag(ν1i)diag(ν 2 2j)) 2, (8.18) and similar for µ 2 and Σ 2. For β, we have β db M G j ϕ jd j=1 l=1 1(f jl b). (8.19) For each E-step, the time complexity for updating ϕ is O(DM GBt E ), where G = max{ G j, [j] M 1 } and t E is the number of iterations in E-step, and the complexity for updating the rest variational parameters is O(D 2 (DM + DN + NM)t E ). For each M- step, the time complexity is O(DM GB) for updating β and O(D 2 N + D 2 M + DMN) for updating other model parameters. The prediction of each missing entry is the same as PPMF, i.e., ˆx ij = λ T 1i λ 2j. As residual PPMF, residual models also apply to PMFF. The main idea is to generate each rating x ij from N (u T i v j + f i + g j, σ 2 ). In inference, they also need a variational Gaussian N (θ 1i, η 2 1i ) for row bias f i and N (θ 2j, η 2 2j ) for column bias g j inference. while doing 8.2 Experimental Results In this section, we use the same movielens data set as in Chapter 7. It contains 1 million ratings for 3900 movies by 6040 users. We also extract three types of side information

151 135 Table 8.1: MSE for PMFF compared to PPMF. (a) PMFF D PPMF ± ± ± ± ± ± PMFF cast ± ± ± ± ± ± PMFF plot ± ± ± ± ± ± PMFF genre ± ± ± ± ± ± (b) Residual PMFF D rsppmf ± ± ± ± ± ± rspmff cast ± ± ± ± ± ± rspmff plot ± ± ± ± ± ± rspmff genre ± ± ± ± ± ± for movies from IMDB 2 cast, genre, and plot. For genre, there are 25 movie types. For cast, we only use the top-10 ranked most important actors/actresses in each movie, and there are totally actors/actresses. For plot, we use the plots written by IMDB users, and there are 2693 words in the dictionary after preprocessing. We then remove the movies with one or more types of side information missing. Similar as in Chapter 7, we also convert each rating entry x ij to be 6 x ij, we use mean square error (MSE) for measurement and run 10-fold cross validation with early stopping for each algorithm. To see whether incorporating side information helps improve the accuracy, we show the results for PMFF with the latent dimension D increasing from 5 to 30 in Table 8.1, where Table 8.1(a) is the result for original models and Table 8.1(b) is the result for residual models (We put rs before the model name to denote residual models.). We put a under the results of PMFF if the MSE is lower than corresponding PPMF which 2

152 136 does not use side information. Also, for each choice of D, we use bold for the best result. In general, incorporating side information in PMFF helps increase the prediction accuracy, for both PMFF and rspmff. Among three types of side information, genre seems to be the most informative, then comes cast, and plots are more hurting the result than helping. For the reasons of bad performance using plots, the plots are quite subjective and highly compressed, and two movies with similar plots may be completely different in their quality. As for the cast, it may help prediction if it contains famous movie stars, but for most actors/actresses, whether he/she shows up in a movie does not seem to affect the rating that much. Also, most actors/actresses only appear in one or two movies, making it difficult to discover the relationship between a certain actor/actress and the movie ratings. In comparison, there are only 25 movie types in genre, so a large number of movies would be assigned to a same movie type, making it easier to find out the relationship between the rating and genre, which could be one reason of genre s usefulness in prediction. However, intuitively, genre is not that informative. A movie will not necessarily get a high or low rating just because it belongs to a certain type. Due to all reasons above, although we have expected better performance, the side information we have used may not be powerful enough to generate a distinct improvement on accuracy, at least through the ways we have considered. 8.3 Conclusion In this chapter, we have discussed PMFF, a probabilistic matrix factorization model which incorporates discrete features for rows and columns as side information. It has PPMF on the movie rating matrix and CTM on the movie s side information, and combines PPMF and CTM by sharing latent variables, i.e., the latent features for movies in PPMF becoming the mixed-membership vectors in CTM. Such coupling between PPMF and CTM allows the side information to affect the factorization results from PPMF. The idea of residual models in PPMF also applies to PMFF. In the experiments, we use three types of features for movies in movie rating data cast, plot and genre. The experiments show that by incorporating proper side information, PMFF outperforms PPMF in missing value prediction to a certain extent.

153 Chapter 9 Kernelized Probabilistic Matrix Factorization In Chapter 8, we have discussed incorporating feature vectors as side information into probabilistic matrix factorization for missing value prediction. There is another type of side information graph, such as user s social network when dealing with movie rating matrix. The intuition is that users connected with each other in a social network probably have similar interests, which can help to predict the missing ratings. In this chapter, we propose kernelized probabilistic matrix factorization (KPMF) which incorporates a graph into matrix factorization. In KPMF, the latent matrix is assumed to be sampled from zero-mean Gaussian processes (GP). The covariance functions of the GPs are derived from the graph kernel functions, and encode the covariance structure across rows and across columns respectively. In principle, KPMF model is also applicable to side information as feature vectors, by choosing appropriate kernel functions such as RBF kernel [111] instead of graph kernel functions, but we only discuss incorporating a graph in this chapter. A key difference between KPMF and probabilistic matrix factorization (PMF) [107] or Bayesian probabilistic matrix factorization (BPMF) [108] is as follows: while PMF assumes an independent latent vector for each row, KPMF works with latent vectors spanning all rows. Therefore, unlike PMF, KPMF is able to explicitly capture the covariances across the rows. Moreover, if an entire row of the data matrix is missing, 137

154 138 Κ 1 Κ 2 D U d* V d* D ~ S x Figure 9.1: Graphical model for KPMF. σ 2 PMF fails to make prediction for that row. In contrast, being a nonparametric model based on a covariance function, KPMF can still make predictions based on the row covariances alone. Similarly, the above argument holds for columns as well. Such differences make KPMF a more powerful model than PMF/BPMF. We demonstrate KPMF through two applications: 1) recommender systems and 2) image restoration. For recommender systems, the side information is users social network, and for image restoration, the side information is derived from the spatial smoothness assumption pixel variation in a small neighborhood tends to be small and correlated. Our experiments show that KPMF consistently outperforms state-of-the-art collaborative filtering algorithms, and produce promising results for image restoration. 9.1 Kernelized Probabilistic Matrix Factorization Before we propose KPMF, we need to make a few notations clear. In previous chapters, we have used u i R D to denote the latent vector corresponding to row i, i.e., u i is the i th column in the latent matrix U R D N. Similar for v j. In this chapter, we slightly change the notation for easy exposition: We use U i to denote the i th column of U (previous u i ), and use U d to denote the d th row of U. Similarly, we use V j to denote the j th column of V R D M and V d to denote the d th row of V.

155 139 In KPMF, the prior distribution of U d and V d, is a zero-mean Gaussian process [104]. Gaussian processes are a generalization of the multivariate Gaussian distribution. While a multivariate Gaussian is determined by a mean vector and a covariance matrix, the Gaussian process GP (m(x), k(x, x )) is determined by a mean function m(x) and a covariance function k(x, x ). In our problem, x is an index of matrix rows (or columns). Without loss of generality, let m(x) = 0, and k(x, x ) denote the corresponding kernel function, which specifies the covariance between any pair of rows (or columns). Also, let K U R N N and K V R M M denote the full covariance matrix for rows of X and columns of X respectively. As we shall see later, using K U and K V in the priors forces the latent factorization to capture the underlying covariances among rows and among columns simultaneously. Assuming K U and K V are known, 1 the generative process for KPMF is given as follows (also see Figure 9.1): 1. Generate U d GP (0, K 1 ), [d] D 1, where GP stands for Gaussian process. 2. Generate V d GP (0, K 2 ), [d] D For each non-missing entry x ij, generate x ij N (U i T V j, σ 2 ). The likelihood over the observed entries in the target matrix X given latent matrices U and V is N M p(x U, V, σ 2 ) = [N (x ij U iv T j, σ 2 )] δ ij, (9.1) i=1 j=1 where δ ij is the indicator taking value 1 when x ij is non-missing and 0 otherwise. The priors over U and V are given by p(u K 1 ) = p(v K 2 ) = D GP (U d 0, K 1 ), (9.2) d=1 D GP (V d 0, K 2 ). (9.3) d=1 1 The choice of K U and K V depends on the specific problem domain, and will be addressed later with particular examples.

156 For simplicity, we denote K 1 1 by A 1, and K 1 2 by A 2. The log-posterior over U and V is hence given by 140 log p(u, V X, σ 2, K 1, K 2 ) (9.4) = 1 2σ 2 N i=1 j=1 M δ i,j (x ij U iv T j ) D U d A 1 Ud T 1 2 d=1 S 2 log σ2 D 2 (log K 1 + log K 2 ) + C, D d=1 V d A 2 V T d where S is the total number of non-missing entries in X, K is the determinant of K, and C is a constant term not depending on the latent matrices U and V. V U T V U T X N X N D M D M (a) PMF/BPMF (b) KPMF Figure 9.2: (a) U T is sampled in a row-wise manner in PMF/BPMF; (b) U T is sample in a column-wise manner in KPMF KPMF Versus PMF/BPMF We illustrate the difference between KPMF and PMF/BPMF in Figure 9.2. In PMF, U T is sampled in a row-wise manner (Figure 9.2(a)), i.e., U i R D is sampled for each row in X. {U i, [i] N 1 } are hence conditionally independent given the prior. As a result, correlations among rows are not captured in the model. In contrast, in KPMF, U T is sampled in a column-wise manner (Figure 9.2(b)), i.e., for each of the D latent factors, U d R N is sampled for all rows of X. In particular, U d is sampled from a Gaussian process whose covariance K 1 captures the row correlations. In this way, during training, the latent factors of each row (U i ) are correlated with those of all the other

157 141 rows (U i for i i) through K 1. Roughly speaking, if two rows share some similarity based on K 1, the corresponding latent factors would also be similar after training. In this chapter, K 1 is derived from a graph kernel function given a graph among the rows, so U i and U i would be similar if i and i are connected in the graph. Similarly, the discussion above also applies to V and the columns of X. The difference between KPMF and BPMF is subtle but they are entirely different models, and cannot be viewed as special cases of each other. PMF is a simple special case of both models, with neither correlations across the rows, nor correlations across latent factors captured. The row and column independencies in PMF and BPMF significantly undermine the power of the model, since strong correlations among rows and/or among columns are often present in real scenarios. For instance, in recommender systems, users decisions on item ratings (represented by rows) are very likely to be influenced by other users who have social connections (friends, families, etc.) with them. PMF and BPMF fail to capture such correlational dependencies. As a result, the proposed KPMF performs considerably better than PMF and BPMF (see Chapter 9.2) Gradient Descent for KPMF We perform a MAP estimate to learn the latent matrices U and V, which maximize the log-posterior in (9.4), and is equivalent to minimizing the following objective function: E = 1 2σ 2 N i=1 j=1 M δ ij (x ij U iv T j ) D U d A 1 Ud T d=1 D d=1 V d A 2 V T d. (9.5) Minimization of E can be done through gradient descent. In particular, the gradients are given by E u di = 1 σ 2 E v dj = 1 σ 2 M δ ij (x ij U iv T j )v dj + e T (i) A 1Ud T, (9.6) j=1 N i=1 δ ij (x ij U T iv j )u di + e T (j) A 2V T d, (9.7)

158 where e (i) denotes an N-dimensional unit vector with the i th component being one and others being zero. The update equations for U and V are u (t+1) di v (t+1) dj 142 = u (t) di η E u di, (9.8) = v (t) dj η E v dj, (9.9) where η is the learning rate. The algorithm updates U and V following (9.8) and (9.9) alternatively until convergence. It should be noted that since K 1 and K 2 remain fixed throughout all iterations, A 1 and A 2 need to be computed only once at initialization. Now suppose an entire row or column of X is missing. While PMF and BPMF fail to address such problem, KPMF still works if appropriate side information is given. In this case, the update equations in (9.8) and (9.9) become u (t+1) di v (t+1) dj = u (t) N di ηet (i) A 1Ud T = u(t) di η = v (t) dj i =1 A 1 (i, i )u di (9.10) M ηet (j) A 2Vd T = v(t) dj η A 2 (j, j )v dj. (9.11) j =1 In this case, update of the corresponding U i is based on the weighted average of the current U over all rows, including the rows that are entirely missing and the rows that are not, and the weights A 1 (i, i ) reflect the correlation between the current row i and the rest. The same holds for V and the columns Stochastic Gradient Descent for KPMF As stochastic gradient descent (SGD) usually converges much faster than gradient descent, we also derive the SGD update equations for KPMF below. The objective function in (9.5) could be rewritten as E = 1 σ 2 N i=1 j=1 M δ ij (x ij U iv T j ) 2 + Tr(UA 1 U T ) + Tr(V A 2 V T ), (9.12) where Tr(X) denotes the trace of matrix X. Moreover, Tr(UA 1 U T ) = Tr(U T UA 1 )

159 U 1 T U 1 U 1 T U 2... U 1 T U N U 2 T =Tr U 1 U 2 T U 2... U 2 T U N U N T U 1 U 1 T U 2... U N T U N N N = A 1 (i, i )U iu T i, i=1 i =1 143 A 1 (1, 1) A 1 (1, 2)... A 1 (1, N) A 1 (2, 1) A 1 (2, 2)... A 1 (2, N) A 1 (N, 1) A 1 (N, 2)... A 1 (N, N) and similarly, Therefore, (9.12) becomes i=1 j=1 Tr(V A 2 V T ) = M M j=1 j =1 i =1 A 2 (j, j )V T j V j. N M [ 1 N E = δ ij σ 2 (x ij U iv T j ) Mi U i T A 1 (i, i )U i + 1 M ] V j T A 2 (j, j )V j Ñ j N M = δ ij E ij, i=1 j=1 where M i is the number of non-missing entries in row i and Ñj is the number of nonmissing entries in column j. Finally, for each non-missing entry (i, j), taking gradient of E i,j with respect to U i and V j gives: E ij U i = 2 σ 2 (x ij U T iv j )V j + 1 Mi E ij V j = 2 σ 2 δ ij(x ij U T iv j )U i + 1 Ñ j Prediction j =1 [ N ] A 1 (i, i )U i + A 1 (i, i)u i, i =1 M A 2 (j, j )V j + A 2 (j, j)v j. Learning based on gradient descent or SGD gives us the estimate of the latent matrices Û and ˆV. For any missing entry x ij, the maximum-likelihood estimation is the inner j =1 product of the corresponding latent vectors, i.e., ˆx ij = Û T i ˆV j.

160 144 u 2 u 3 u 1 u 6 u 5 u 4 (a) Social network graph (b) Rating matrix Figure 9.3: Example input data: (a) social network among 6 users; (b) observed rating matrix for the 6 users on 4 items. Table 9.1: Number of users, items and ratings of the data sets used. Flixster Epinion # Users # Items # Ratings 173,172 60,485 # Relations 32,548 74,575 Rating Density 2.89% 1.00% 9.2 Experiments on Recommender Systems In this section, we evaluate the KPMF model for item recommendation with known user relations. In particular, we are given a user-item rating matrix with missing entries as well as a social network graph among users (see Figure 9.3). The goal is to predict the missing entries in the rating matrix by exploiting both the observed ratings and the underlying rating constraints derived from the social network. We run experiments on two publicly available datasets, i.e. Flixster [64] and Epinion [84], and compare the prediction results of KPMF with several other algorithms Datasets Flixster 2 is a social movie website, where users can rate movies and make friends at the same time. The social graph in Flixster is undirected, and the rating values are 10 discrete numbers ranging from 0.5 to 5 in steps of

161 Number of Users Number of Users >320 Number of Observed Ratings >320 Number of Observed Ratings (a) Flixster (b) Epinion Number of Users Number of Users > 160 Number of Friends > 160 Number of Friends (c) Flixster (d) Epinion Figure 9.4: (a) and (b): Histograms of users rating frequencies. (c) and (d): Histograms of the number of friends for each user. Epinion 3 is a customer review website where users share opinions on various types of items such as electronic products, companies, and movies, through writing reviews or giving ratings. Each user also maintains a list of people he/she trusts, which forms a social network with trust relationships. Unlike Flixster, social network in Epinion is an directed graph, but for simplicity, we convert the directed edges to be undirected ones by keeping only one edge between two users if they are connected in either way originally. The rating values in Epinion are discrete values ranging from 1 to 5. For each dataset, we sampled a subset with 2,000 users and 3,000 items. For the purpose of testing our hypothesis whether the social network could help in making ratings prediction the 2,000 users selected are users with most friends in the social network, while the 3,000 items selected are the most frequently rated overall. statistics of the datasets are given in Table 9.1. Figure 9.4 shows the histograms for the number of past ratings and number of friends each user has. 3 The

162 9.2.2 Graph Kernels 146 To construct kernel matrices suitable to our problem, we consider the users social network as an undirected, unweighted graph G with nodes and edges representing users and their connections. Elements in the adjacency matrix of G are determined by W i,j = 1 if there s an edge between user i and j, and 0 otherwise. The Laplacian matrix [32] of G is defined as L = Λ W, where the degree matrix Λ is a diagonal matrix with diagonal entries λ ii = M j=1 W i,j(i = 1,..., N). Graph kernels provide a way of capturing the intricate structure among nodes in a graph (If instead we are given features or attributes of the users, we could replace graph kernels with polynomial kernels, RBF kernels, etc. [111]). In our case, a graph kernel defines a similarity measure for users taste on certain items. Generally speaking, users tend to have similar taste with their friends and families, and thus their ratings for the same items would also be correlated. Graph kernels could capture such effects in the social network, and the resulted kernel matrix would provide key information about users rating patterns. In this work, we examine three different graph kernels and refer readers to [71] for more available choices. Diffusion kernel: The Diffusion kernel proposed in [71] is derived from the idea of matrix exponential, and has a nice interpretation on the diffusion process of substance such as heat. In particular, if we let some substance be injected at node i and flow along the edges of the graph, K DF (i, j) can be regarded as the amount of the substance accumulated at node j in the steady state. The diffusion kernel intuitively captures the global structure among nodes in the graph, and it can be computed as follows: ( K DF = lim 1 βl ) n = e βl, (9.13) n n where β is the bandwidth parameter that determines the extent of diffusion (β = 0 means no diffusion). Commute Time (CT) kernel: As proposed in [49], the Commute Time kernel is closely related to the so-called average commute time (the number of steps a random

163 147 walker takes to commute between two nodes in a graph), and can be computed using the pseudo-inverse of the Laplacian matrix: K CT = L. Moreover, since K CT is conditionally positive definite, K CT (i, j) behaves exactly like an Euclidean distance between nodes in the graph [80]. As a consequence, the nodes can be isometrically embedded in the subspace of R n (n is the number of nodes), where the Euclidean distance between the points is K CT (i, j). Regularized Laplacian (RL) kernel: Smola and Kondor [120] introduce a way of performing regularization on graphs that penalizes the variation between adjacent nodes. In particular, it turns out that the graph Laplacian could be equally defined as a linear operator on the nodes of the graph, and naturally induces a semi-norm on R n. This semi-norm quantifies the variation of adjacent nodes, and could be used for designing regularization operators. Furthermore, such regularization operators give rise to a set of graph kernels, and among them is the Regularized Laplacian kernel: K RL = (I + γl) 1, (9.14) where γ > 0 is a constant Methodology Given the social network, we use the Diffusion kernel, the CT kernel, and the RL kernel described above to generate the covariance matrix K 1 for users. The parameter setting is as follows: β = 0.01 for the Diffusion kernel and γ = 0.1 for the RL kernel, which are chosen via validation. Since no side information is available for the items in the Flixster or Epinion dataset, K 2 is assumed diagonal: K 2 = σv 2 I, where I is an M M identity matrix. Given the covariance matrix K 1 generated from the graph kernels, we perform gradient descent (KPMF with stochastic gradient descent is discussed later) on U and V using the update equations (9.6) and (9.7). 4 At each iteration, we evaluate the root mean square error (RMSE) on the validation set, and terminate training once the RMSE starts increasing, or the maximum number of iterations allowed is reached. The learned 4 While gradient descent in KPMF involves the inverse of K 1, we actually don t need to invert the matrix when using the CT kernel or the RL kernel since K 1 CT = L and K 1 RL = I + γl.

164 148 RMSE PMF SoRec KPMF (Diffusion) KPMF (CT) KPMF (RL) RMSE PMF SoRec KPMF (Diffusion) KPMF (CT) KPMF (RL) % 40% 60% 80% Percentage of Ratings for Training % 40% 60% 80% Percentage of Ratings for Training (a) Flixster, D=5 (b) Flixster, D= PMF SoRec KPMF (Diffusion) KPMF (CT) KPMF (RL) PMF SoRec KPMF (Diffusion) KPMF (CT) KPMF (RL) RMSE RMSE % 40% 60% 80% Percentage of Ratings for Training 20% 40% 60% 80% Percentage of Ratings for Training (c) Epinion, D=5 (d) Epinion, D=10 Figure 9.5: RMSE for different algorithms on Flixster and Epinion datasets (best viewed in color). Lower is better. U and V are then used to predict the ratings in the test set. We run the algorithm with different latent vector dimensions, viz, D = 5 and D = 10. We compare the performance of KPMF (using gradient descent) with three algorithms: The first one uses only the information from the social network. More specifically, to predict the rating x ij, we take the neighbors of user i from the social network, and average their ratings on item j as the prediction for x ij. We denote this method as social network based algorithm (SNB). If none of the neighbors of user i has rated item j, SNB cannot predict x ij. The second algorithm we compare with is PMF [107], which only uses the information from the rating matrix. 5 The third algorithm is SoRec [82], a state-of-the-art collaborative filtering algorithm that combines information from both 5 BPMF actually performs worse than PMF in our experiments, which might be due to improper parameter setting in the code published by the authors. Thus we omit reporting BPMF results here.

165 149 the rating matrix and the social network by performing matrix factorization jointly. In addition, we also compare the computational efficiency between KPMF using gradient descent and KPMF using stochastic gradient descent. Percentage of improvement in RMSE KPMF(Diffusion) KPMF(CT) KPMF(RL) 0 20% 40% 60% 80% Percentage of Ratings for Training (a) Flixster, D=5 Percentage of improvement in RMSE KPMF(Diffusion) KPMF(CT) KPMF(RL) 0 20% 40% 60% 80% Percentage of Ratings for Training (b) Flixster, D=10 Percentage of improvement in RMSE KPMF(Diffusion) KPMF(CT) KPMF(RL) % 40% 60% 80% Percentage of Ratings for Training (c) Epinion, D=5 Percentage of improvement in RMSE KPMF(Diffusion) KPMF(CT) KPMF(RL) 7 20% 40% 60% 80% Percentage of Ratings for Training (d) Epinion, D=10 Figure 9.6: Performance improvement of KPMF compared to PMF on training sets with different percentage of training data used (best viewed in color). The improvement of KPMF versus PMF decreases as more training data are used. This is because for sparser datasets, PMF would have relatively more difficulty in learning users preferences from fewer number of past ratings, while KPMF could still take advantage of the known social relations among users and utilize the observed ratings better. Another experiment is prediction for users with no past ratings. We test on 200 users who have most connections in the social network (so that the effect of the social network is most evident). All the past ratings made by these 200 users in the training set are not used for learning. Therefore, the observed rating matrix would contain 200 rows of zeros, and the rest 1,800 rows remain unchanged. For SNB, since we cannot do missing value prediction for some entries if none of the corresponding user s neighbors has rated the target item, we also define a measure of

166 coverage for reference, which is defined as the percentage of ratings that can be predicted among all the test entries. 150 Table 9.2: RMSE and Coverage from SNB on Flixster and Epinion. (a) Flixster Training data used 20% 40% 60% 80% RMSE Coverage (b) Epinion Training data used 20% 40% 60% 80% RMSE Coverage Table 9.3: RMSE on users with no ratings for training. (a) 20% training data used Flixster Epinion D = 5 D = 10 D = 5 D = 10 Item Average KPMF(Diffusion) KPMF(CT) KPMF(RL) (b) 80% training data used Flixster Epinion D = 5 D = 10 D = 5 D = 10 Item Average KPMF(Diffusion) KPMF(CT) KPMF(RL) Results The RMSE on Flixster and Epinion for PMF, SoRec, and KPMF (with different kernels) are given in Figure 9.5. In each plot, we show the results with different number of ratings used for training, ranging from 20% to 80% of the whole dataset. The main observations are as follows: 1. KPMF, as well as SoRec, outperforms PMF on both Flixster and Epinion, regardless of the number of ratings used for training. While KPMF and SoRec use both the social network and the rating matrix for training, PMF uses the rating matrix alone. The performance improvement of KPMF and SoRec over PMF suggests that social network is indeed playing a role in helping predict the ratings. In addition, KPMF also outperforms SoRec for most cases.

167 Figure 9.6 shows KPMF s percentage of improvement compared to PMF in terms of RMSE. For KPMF, we can see that the performance gain increases as the training data gets sparser. It implies that when the information from the rating matrix is getting weaker, the users social network is getting more useful for prediction. 3. As shown in Figure 9.5, among the three graph kernels examined, the CT kernel leads to the lowest RMSE on both Flixster and Epinion. The advantage is more obvious on Flixster than on Epinion. 4. We also give the RMSE of SNB in Table 9.2 for reference. RMSE for this simple baseline algorithm is much higher than the other algorithms. The coverage is low with a sparse training matrix, but gets higher when the sparsity decreases. Table 9.3 shows the results for the experiment of prediction for users with no past ratings. The RMSE are over the selected 200 users who have most connections in the social network, and their past ratings are not utilized during training. To contrast, we only show results on the datasets with 20% and 80% training data. KPMF consistently outperforms Item Average 6 by a large margin if 20% training data are used, but the advantage is not so obvious for the dataset of 80% training data (Note that Item Average actually outperforms KPMF on Epinion for Diffusion and RL kernels). This result again implies that the side information from users social network is more valuable when the observed rating matrix is sparse, and the sparsity is indeed often encountered in real data [110]. Finally, we compare the computational efficiency between KPMF with stochastic gradient descent (KPMF SGD ) and KPMF with gradient descent (KPMF GD ). Table 9.4 shows the RMSE results and running time for the two, where we set D = 10 and use p% = 20% of ratings for training. Although KPMF SGD has slightly higher RMSE than KPMF GD, it is hundreds of times faster. Similar results are also observed in experiments with other choices of D and p%. Therefore, for large scale datasets in real applications, KPMF SGD would be a better choice. 6 The algorithm that predicts the missing rating for an item as the average of its observed ratings by other users.

168 152 Table 9.4: Comparison of RMSE and running time for KPMF GD and KPMF SGD. KPMF SGD is slightly worse than KPMF GD in terms of RMSE, but significantly faster. Flixster Epinion KPMF GD KPMF SGD KPMF GD KPMF SGD RMSE Time (sec) r 1 r 2 r3 r4 r 5 r 6... r 1 r 2 r3 r4 r 5 r 6... Figure 9.7: The graph constructed for the rows of the image using the spatial smoothness property. (Top: = 1, Bottom: = 2.) 9.3 Experiments on Image Restoration In this section, we demonstrate the use of KPMF in image restoration to further illustrate the broad potential of this proposed framework and the relevance of incorporating side information. Image restoration is the process of recovering corrupted regions of a target image [17]. Let us denote the N M target image by P, and the corrupted region to be recovered by Ω (see black scribbles on the second column of Figure 9.8). The task is to fill in the pixels inside Ω in a way that is coherent with the known pixels outside Ω, i.e. P Ω. Now one might notice that this problem is quite similar to the one faced in recommender systems, where the rating matrix X becomes P, ratings become pixel values, and the missing entries become Ω. Therefore, if we consider the rows of the image as users and the columns as items, we could apply the KPMF algorithm to fill in the pixels in Ω just as predicting missing entries in recommender systems. However, since no direct information of correlations among rows or columns of the image is given, the difficulty arises in obtaining proper kernel matrices. One way to address this is to construct such a graph for images in analogy to the users social network for recommender systems. To do that, we consider the spatial smoothness in an image (while this will be used here to illustrate the proposed framework, we can consider graphs derived from other attributes as well, e.g., smoothness in

169 153 feature space, with features derived from local patches or texture-type multiscale analysis [41]). Below we describe how to construct the graph for rows using this property (the graph for columns can be constructed in a similar fashion). First, we assume that each row is similar to its neighboring rows and thus directly connected in the graph. Let r i (i = 1,..., N) be the node in the graph that represents the i th row of the image (nodes representing parts of the image rows could be considered as well to further localize the structure). Then there exists an edge between r i and r j (j i) if and only if i j, where is a constant that determines the degree of r i (see Figure 9.7). The corresponding adjacency matrix of the graph is a band matrix with 1 s confined to the diagonal band and 0 s elsewhere. Given the graphs for rows and columns, we can obtain their corresponding kernel matrices by applying the graph kernels from Chapter Since each color image is composed of three channels (Red, Green and Blue), the KPMF update equations for learning the latent matrices are applied to each channel independently. Finally, the estimation for Ω from the three channels, along with the known pixels in P Ω, are combined together to form the restored image. Restoration results using KPMF on several corrupted images are shown in Figure 9.8. We can see that KPMF is doing an excellent job on relatively easy tasks in the first two rows, and a satisfactory job on highly challenging tasks in the last three rows. 9.4 Conclusion In this chapter, we have discussed KPMF. It incorporates a graph into probabilistic matrix factorization for missing value prediction by exploiting the underlying covariances among rows and among columns encoded in the graph. KPMF introduces Gaussian process priors for latent matrices in the generative model, which forces the learned latent matrices to respect the covariance structure among rows and among columns, enabling the incorporation of side information when learning the model. As demonstrated in the experiments, this characteristic could play a critical role in boosting the model performance, especially when the observed data matrix is sparse. Another advantage of KPMF over PMF and BPMF is its ability to predict even when an entire row/column of the data matrix is missing as long as appropriate side

170 154 information is available. In principle, KPMF is applicable to general matrix completion problems, but in this chapter we focus on two specific applications: recommender systems and image restoration. In the future, we would like to generalize the current model to handle the case of weighted entries, where different entries are assigned different weights according to some pre-defined criteria.

171 155 Figure 9.8: Image restoration results using KPMF (best viewed in color). From left to right: original images, corrupted images (regions to be restored are in black), and images restored using KPMF. For KPMF, equals to 5 when constructing the row and column graphs, and Diﬀusion kernel with β = 0.5 is used to obtain the kernel matrices.

Latent Dirichlet Conditional Naive-Bayes Models

Latent Dirichlet Conditional Naive-Bayes Models Arindam Banerjee Dept of Computer Science & Engineering University of Minnesota, Twin Cities banerjee@cs.umn.edu Hanhuai Shan Dept of Computer Science &