A Variational Approach to Semi-Supervised Clustering

Size: px

Start display at page:

Download "A Variational Approach to Semi-Supervised Clustering"

Kathlyn Freeman
6 years ago
Views:

1 A Variational Approach to Semi-Supervise Clustering Peng Li, Yiming Ying an Colin Campbell Department of Engineering Mathematics, University of Bristol, Bristol BS8 1TR, Unite Kingom Abstract We present a variational inference scheme for semi-supervise clustering in which ata is supplemente with sie information in the form of common labels. There is no mutual exclusion of classes assumption an samples are represente as a combinatorial mixture over multiple clusters. The metho has other avantages such as the ability to fin the most probable number of soft clusters in the ata, the ability to avoi overfitting an the ability to fin a confience measure for membership of a soft cluster. We illustrate performance of the metho on six ata sets an fin a positive comparison against constraine K-means clustering. 1 Introuction With semi-supervise clustering the aim is to fin clusters or meaningful partitions of the ata, aie by the use of sie information. This sie information can be in the form of must-links (two samples must be in the same cluster) an cannot-links (two samples must not belong to the same cluster). A number of approaches have been propose for semi-supervise clustering. Typically these methos have use the constraints from the sie information to either alter the objective function or the istance measure use. Many methos have use moifications of pre-existing clustering schemes such as incremental clustering algorithms 13], K-means clustering 2, 7] or Hien Markov Ranom Fiels 3]. One problem with some of these approaches is that the metho can work well if the correct number of clusters is alreay known: K-means clustering is an example. However, a principle approach to fining K is generally not given. Other potential isavantages of some clustering approaches is that there is an implicit mutual exclusion of clusters assumption i.e. samples are assume to uniquely belong to a particular cluster. In many cases this assumption may not be fully vali an it woul be more appropriate for a sample to be associate with multiple soft clusters. A further issue with some clustering approaches is that there is no principle mechanism to avoi fitting to noise in the ata. With Bayesian methos, however, we can incorporate a Bayes prior which penalizes such overfitting. Our motivation for consiering semi-supervise clustering comes from potential applications in cancer informatics. There have been a number of instances where unsupervise learning methos have been applie to cancer microarray ata sets, for example, an clinically istinct subtypes of cancer have been inicate e.g. 1, 9]. However, in some cases a specific causative event is known an thus it is possible to give common labels to some samples. A specific example might be acute lymphoblastic leukemia 15]. This isease is known to have a number of subtypes with variable response to chemotherapy. An originating event for some of these subtypes is believe known, sometimes stemming from the creation of fusion genes through genetic rearrangement involving genes BCR-ABL, E2A-PBX1, TEL-AML1 or rearrangements of the MLL gene. These rearrangements can be etecte via FISH (Fluorescent in situ Hybriization), an thus we can assign common labels to certain samples an use semi-supervise clustering to improve characterization of unlabelle samples. In these cancer applications, the sie information is typically in the form of must-links which will therefore be the focus of this paper. We thus propose a probabilistic graphical moel approach to semi-supervise clustering with the sie information given as the aition of must-links between some samples. The propose metho provies an objective measure of the number of clusters present an can reaily hanle missing values. Samples are represente as combinatorial mixtures over a set of latent processes or soft clusters, so there is no Current aress: Dept. of Computer Science, University College Lonon, Gower Street, Lonon WC1E 6BT, Unite Kingom 1

2 mutual exclusion of clusters assumption. In aition, by allowing a representation overlapping ifferent clusters we can erive a confience measure for cluster membership: a confience measure for subtype membership is important in clinical cancer informatics applications. As a Bayesian methoology we can also moify the metho to incorporate a Bayes prior penalizing over-complex moels which woul fit to noise in the ata. In the next two sections we propose our semi-supervise probabilistic graphical moel, followe by experiments in section 4. 2 The Methos Our semi-supervise moel utilizes the Latent Process Decomposition (LPD) moel evelope in 5, 12], an hence we will call this variant semi-supervise LPD or SLPD. For the natural numbers we aopt the notation N m = {1,..., m} for any m N. For the ata we use as the sample inex, g as the attribute inex, an script letters D, G to inex the corresponing number of samples an attributes. The number of clusters is K. The complete ata set is E = {E g : N D, g N G }. This notation stems from our cancer informatics motivation of using the expression value of gene g in sample. In our semi-supervise setting, we have aitional block information C where each block c enotes a set of ata points that is known to belong to a single class. In keeping with the stanar Bayesian moels, we also assume both blocks an the ata points in each block are i.i.. sample. Specifically, this sie information can be represente by a D C matrix δ with its entities δ c efine as follows { 1 if ata is a member of block c, δ c = 0 otherwise. We can now escribe our semi-supervise graphical moel. In probabilistic terms, the ata set E can be partitione into K-processes (soft clusters) escribe as follows. For a complete ata set, a Dirichlet prior istribution for the istribution of processes is efine by K-imensional parameter α. For each known block c, a istribution θ c over the set of mixture components inexe by k is rawn from a single Dirichlet istribution parameterize by α. Then, for all samples in block c (i.e. δ c = 1), the latent inicator variable Z g inicates which process k is chosen, with probability θ ck, from the common block-specific istribution θ c. The value E g for attribute g in sample is then rawn from the kth Gaussian with mean µ gk an eviation σ gk, enote as N(E g µ gk, σ gk ). We repeat the above proceure for each block in C. The graphical moel is illustrate in Figure 1 which is motivate by Latent Dirichlet Allocation 5]. Figure 1: A graphical moel representation of the metho propose in this paper. E g enotes the value of attribute g in sample. µ an σ are moel parameters. Z g is a hien variable giving the process inex of attribute g in sample. θ c gives the mixing over subgroups for sample in block c enote by c. The probability of θ c is given by a Dirichlet istribution with hyper-parameter α. The moel parameters are Θ = (µ, σ, α) an we use the notation c to enote sample in block c. From the graphical moel in Figure 1, we can formulate the block-specific joint istribution of the observe ata E an the latent variables Z by p(e, θ, Z Θ, C) = c p(θ c α) c p(e, Z θ c, Θ), (1) where p(θ c α) is Dirichlet efine by p(θ c α) = Γ( k α k) k Γ(α k) see that k θα k 1 ck. Using the block matrix δ, we further 2

3 p(e, Z θ c, Θ) = ] δc p(z θ c )p(e Z, µ, σ) c = δc g p(z g θ c )N(E g µ g, σ g, Z g )]. (2) For notational simplicity, we regar Z g as a unit-basis vector (Z g,1,..., Z g,k ) which transforms the process latent variable Z g = k to the unique vector Z g given by Z g,k = 1 an Z g,j = 0 for j k. Equivalently, the ranom variable Z g is istribute accoring to a multinomial probability efine by p(z g θ c ) = k θz g,k ck. Hence, the above equation can be rewritten as p(e, θ, Z Θ, C) = p(θ c α) Zg,k δ c θ ck N(E g µ gk, σ gk )]. (3) c g k With these priors, the final ata likelihoo can be obtaine by marginalizing out the latent variables θ an Z := {Z g : N D, g N G } p(e Θ, C) := p(e, θ, Z Θ, C)θ. (4) In particular, we can see from equations (2) an (3) that θ Z p(e Θ, C) = δc p(z g θ c )N(E g µ g, σ g, Z g )] p(θ c α)θ c c θ c Z g = Zg,k θ ck N(E g µ gk, σ gk )] p(θ c α)θ c c θ c Z g, c c,g,k = δc θ ck N(E g µ g, σ g, Z g )] p(θ c α)θ c. c θ c,g k We shoul mention that, without block information (i.e. δ = I D D ), the above equation is the exact likelihoo of Latent Process Decomposition given in 12]. 3 Moel inference an parameter learning We now consier moel inference an parameter estimation uner SLPD. The main inferential goal is to compute the posterior istribution of the hien variables p(θ, Z E, Θ, C). One irect metho is to use Bayes rule p(θ, Z E, Θ, C) = p(e,θ,z Θ,C) p(e Θ,C). This approach is usually intractable since this involves computationally intensive estimation of multi-integrals in the final likelihoo p(e θ, C). In this paper, we rely on variational inference methos 10, 11] which maximize a lower boun on the likelihoo p(e Θ, C) to estimate the moel parameters Θ an approximate p(θ, Z E, Θ, C) in a hypothesis family. One common hypothesis family is the factorize family efine by q(θ, Z γ, Q) = q(θ γ)q(z Q) with variational parameters γ, Q where, in the expression of the istribution q, the epenency on the E, Θ, C is omitte. More specifically, in our moel we assume that q(θ γ) = c q(θ c γ c ) = ( Γ( k γ ) ck) c k Γ(γ ck) k θγ ck 1 ck, an q(z Q) =,g q(z g Q g ) = ),g ( k QZ g,k g,k, among which γ, Q will be set as we escribe below. We can lower boun the log-likelihoo by applying Jensen s inequality to equation (4): log p(e Θ, C) = log p(e, θ, Z Θ, C)θ θ Z L(γ, Q; Θ) := E q log p(e, θ, Z Θ, C) ] Eq q(θ, Z γ, Q) ]. Consequently we can estimate the variational an moel parameters by alternative coorinate ascent methos known as a variational EM algorithm. The etails are summarize in the Appenix. (5) 3

4 E-step: maximize L with respect to the variational parameters γ, Q an get the upates Q g,k = N(E g µ gk, σ gk ) c exp(δ c(ψ(γ ck ) Ψ( k γ ck))) ] k N(E g µ gk, σ gk ) c exp(δ cψ(γ ck ) Ψ( k γ ], (6) ck)) an γ ck = α k +,g δ c Q g,k. (7) M-step: maximize L with respect to α, µ an σ an get that µ gk = Q g,ke g Q, σgk 2 = Q g,k(e g µ gk ) 2 g,k Q. (8) g,k an the parameter α is foun using a Newton-Raphson metho as given in the Appenix. The above iterative proceure is run until convergence (plateauing of the lower boun or the estimate likelihoo p(e Θ, C), see the Appenix for etails). Interpretation of the resultant moel is very similar to Latent Process Decomposition 12]. When normalize over k, the parameter γ ck gives the confience that a sample belonging to block c (which share a common label) belongs to soft cluster k. For each soft cluster k the moel parameters µ gk an σ gk give a ensity istribution for the attribute value g over all samples, see 6] for examples of use of these ensity estimators in application to interpreting breast cancer microarray ata. If some values of E g are missing, we omit corresponing contributions in the M-step upates an the corresponing Q g,k. The above argument is base on a maximum likelihoo approach. We en this section with a comment about the maximum a posterior (MAP) solution. In this case µ gk an σ gk are prouce by prior istributions p(µ) = gk p(µ gk) an p(σ 2 ) = gk p(σ2 gk ) with p(µ gk) N(0, σ µ ), p(σgk 2 ) exp( ) s σ. gk 2 These priors for penalizing over-complexity are justifie in 12]. In this case we only nee to change the upates for µ an σ µ gk = σ2 µ Q g,ke g σgk 2 + σ2 µ Q g,k), σ2 gk = Q g,k(e g µ gk ) 2 + 2s Q. (9) g,k The argument leaing to the above upates is the same as before except we replace the likelihoo p(e Θ) by p(e Θ)p(µ)p(σ), an consequently the lower boun is replace by L(Q; Θ) + log p(µ) + log p(σ). 4 Experimental Results To valiate the propose approach, we investigate the ML solution applie to four atasets from the UCI Repository 8] an two cancer microarray ata sets (see Table 1). The four ata sets from the UCI Repository have known sample labels an thus we can achieve an objective performance measure. As mentione in the introuction section, our interest in semi-supervise clustering stems from a potential use in cancer informatics. Thus the next two ata sets we consier are for cancer. One is a leukemia microarray ata 15]: in this case some labels are known since the causative events are observe genetic translocations or rearrangements. We have only use a subset of the original ata with unambiguous sample labels. The secon ataset is for lung cancer 4]. Again this consiste of histologically labelle samples erive from squamous cell lung carcinomas (21 samples), aenocarcinomas (139 samples), pulmonary carcinois (20 samples), small-cell lung carcinomas (6 samples) an normal lung tissue (17 samples). The imensionality of the both cancer atasets was reuce to 500 features base on largest variance. In the evaluation of the methos given below, we investigate three issues. Firstly, we pursue a comparison of the propose metho with pre-existing semi-supervise clustering methos. As our principal comparison we will use constraine K-means clustering (CKM) 2, 7] since this metho has been wiely use (we will only compare against constraine K-means clustering with must-links). Seconly, we will consier the possible gains to be mae by using sample label information. Finally, we also compare the unsupervise ULPD with semi-supervise SLPD to evaluate the gains mae by using sie information. We use the Balance Ran Inex (BRI) as evaluation criterion. Let nts (ntd) be the true number of pairs of ata in the same (ifferent) clusters an np S (np D) be the correctly preicte number of pairs of ata in the same (ifferent) clusters. BRI is efine as 4

5 Data Sets Letter {I, J} Wine Iris Digit {3, 6, 8} Leukemia Lung Cancer Number of Samples Number of Features Number of Classes Table 1: Data sets use in the experimental stuy. Data UKM CKM ULPD SLPD Sets 0 25% 50% 0 25% 50% Letter 0.501± ± ± ± ± ±0.039 Wine 0.877± ± ± ± ± ±0.032 Iris 0.824± ± ± ± ± ±0.038 Digit 0.751± ± ± ± ± ±0.045 Table 2: A comparison of constraine K-means clustering an SLPD. The entries are the BRI (mean ± stanar eviation over 100 trials using 3-fol cross valiation tests). Hypothesis testing inicates a statistically significant performance gain over CKM. ( nps BRI = 0.5 nts + npd ) ntd (10) The stanar Ran Inex is efine as (nps + npd)/(nts + ntd), which favors assigning ata points to ifferent clusters 14]. BRI favors matche pairs an mismatche pairs equally, hence it is more suitable for evaluating clustering algorithms. In Table 2 we tabulate performance for both constraine K-means clustering an SLPD using the BRI. For each stuy, the whole ata set was ivie into 3 fols by ranom partition. One fol of the ata set was use as a test set an a subset of the remaining two fols was use for supervise training. Exactly the same sample allocations were use in the evaluation of both constraine K-means clustering an SLPD. For the training set we also impose a egree of supervision to compare unsupervise learning with semi-supervise clustering using ifferent levels of supervision. The fraction of the ata set use for supervision was 0% (unsupervise), 25% an 50%. With ranom resampling of the ata, this 3-fol cross valiation proceure was repeate over 100 trials. As observe from Tables 2 an 3, SLPD compares favorably with CKM. In aition, by enforcing a egree of supervision (from 0% to 50%), we can observe performance improvements of both SLPD an CKM over LPD an unconstraine K-means clustering (UKM). As a real-life application we teste SLPD on two cancer microarray atasets for leukemia an lung cancer. We selecte these atasets since labels can be reliably assigne to a large proportion of the samples an thus we can evaluate performance of the propose metetho. Thus, for the leukemia ataset, we use a reuce 90-sample, 6-class subset of the ata where labels can be assigne in our stuy (balance to give 15 samples per class). However, for the full ataset there are a small number of samples with unclear assignment to subtype: this subset woul be a goo target for semi-supervise clustering since we woul want to use all available information to characterise these samples as efining novel subtypes or as having a relation to known subtypes. The results are tabulate in Table 3, where the values are the average values of 50 runs. For the leukemia ataset, both unsupervise LPD an semi-supervise LPD were traine on 45 samples from 3 classes, with clustering performance evaluate on the other 45 samples from the remaining 3 classes. For leukemia we fin that the BRI inex improves as we increase the extent of supervision. We also fin that SLPD consistently outperforms K-means clustering. A similar picture is repeate for lung cancer. Constraine K-means clustering is faster to execute compare to SLPD. However, SLPD has important avantages over constraine K-means. K-means clustering can work well if K is known but the metho oes not have an intrinsic mechanism for etermining the correct moel complexity. With SLPD we can etermine the appropriate number of clusters by etermining the estimate log-likelihoo on hol-out ata in a cross-valiation stuy (see Appenix 6.2). In Figure 2 we illustrate this proceure using the wine ata set mentione earlier. This ata set has 3 class labels an 178 samples. To etermine the appropriate moel complexity we fin the average log-likelihoo on 28 hol-out samples using the likelihoo estimate given in the Appenix (section 6.2, the curves are averages over 100 runs, error bars are omitte to simplify the Figure). For unsupervise learning the peak is at 4 (soli curve). This result appears reasonable since a enrogram ecomposition 5

6 Data UKM CKM ULPD SLPD Sets 0 25% 50% 0 25% 50% Leukemia 0.786± ± ± ± ± ±0.048 Lung Cancer 0.578± ± ± ± ± ±0.039 Table 3: A comparison of constraine K-means clustering an SLPD. The entries are the BRI (mean ± stanar eviation over the 50 trials of 3-fol cross valiation tests). Hypothesis testing inicates a statistically significant performance gain over CKM. of this ata set (Figure 2 (right)) suggests this is an appropriate ecomposition. If all the class labels are use we get a sharp peak at 3 (upper ashe curve) as expecte. However, if we use the labels of one class an leave the other two classes unlabelle we get a shallow peak at 3 (lower otte curve). This illustrates the point that the class labelling may not necessarily reflect the unerlying cluster structure, potentially negating any gains mae by using sie information. Log likelihoo Unsupervise LPD SLPD with 1 class labelle SLPD with all classes labelle x K number of clusters 0 Figure 2: The left-han Figure gives the Log-likelihoo (y-axis) versus K for the wine ata set for varying egrees of supervision (see text). The right-han Figure gives a enrogram partitioning of the same ata set using a correlation istance measure an average linkage (only the upper branches of the enrogram are illustrate). 5 Conclusion In this paper we have presente a novel variational inference metho for clustering with sie information. The metho provies a principle approach to moel selection via etermination of the log-likelihoo on hol-out ata, as illustrate in Figure 2. By contrast, many earlier methos for clustering with sie information appear to lack a soun approach to establishing the correct moel complexity. As a Bayesian methoology we coul also implement prior beliefs which can improve moel performance. Specifically, we can incorporate a Bayes prior penalizing overfitting to noise. Semi-supervise clustering methos have many important applications: we have mentione cancer applications where partial labelling of the samples is sometimes possible. Determining the number of subtypes an avoiing a fit to the extensive noise present in many of these ata sets are important issues in this application omain. 6 Appenix: erivation of the inference upates for SLPD 6.1 EM-upates In the E-step, we compute the posterior istribution q(e, θ γ, Q) an get the corresponing upates for γ, Q. To this en, we take the functional erivative of q, an get that q(z Q) exp ( E q(θ γ) log p(e, θ, Z Θ, C) ]), (11) an q(θ γ) exp ( E q(z Q) log p(e, θ, Z Θ, C) ]). (12) 6

7 Note that the log joint likelihoo can be expresse by log p(e, θ, Z Θ, C) = c,k (α k 1)log θ ck + C(log Γ( k α k) k log Γ(α k)) + c,,g,k Z g,kδ c log θ ck + c,,g,k Z g,kδ c log N(E g µ gk, σ gk ) = c,k (α k 1)log θ ck + C(log Γ( k α k) k log Γ(α k)) + c,,g,k Z g,kδ c log θ ck +,g,k Z g,k log N(E g µ gk, σ gk ), where we use the fact c δ ] c = 1 for any in the last equation. Note that E q(θc γ c) log θck = Ψ(γck ) Ψ( k γ ck) where Ψ is igamma function. Consequently, from the above log joint likelihoo equation we have that ] E q(θ γ) log p(e, θ, Z Θ, C) =,g,k g,k( Z c δ c(ψ(γ ck ) Ψ( ) k γ ck)) + log N(E g µ gk, σ gk, ] an E q(z Q) log p(e, θ, Z Θ, C) = c,k log θ ck( αk 1 +,g Q ) g,kδ c. Therefore, putting the above equations into equations (11), (12) an normalizing Q we have the esire upates (6) an (7) in Section 3. L At the M-step, the upates (8) for µ, σ are easy to get by solving the stationary equations µ gk = 0 an L σ 2 gk = 0. However, for the parameter α, observe that, in its erivative equation L α i = CΨ( k α k) CΨ(α i )+ c (Ψ(γ ci) Ψ( k γ ck)), the variable α i is mixe with other ones α j with j i, then we have to use iterative algorithm. Here, we employ a Newton-Raphson metho to approximate the optimal α. For this purpose, we compute the Hessian as follows H ij = C(Ψ ( k α k) δ ij Ψ (α i )). Hence, we have the iterative proceure α new = α ol ( H(α ol ) ) 1 L(α ol ). α Because of the special form of the Hessian matrix (ecomposable into iagonal an off-iagonal terms) we can avoi matrix inversion of H ij, see Appenix of 5]. 6.2 Computing the likelihoo Once the parameters have been estimate, we can calculate the likelihoo using equation (5). By the law of large numbers in probability theory, we can estimate the expectation with respect to p(θ c α) by averaging over large enough set of N samples: p(e Θ, C) = c 1 N ] δc N N(E g µ gk, σ gk )θ ckn. n=1 g k where {θ ckn } are N samples rawn from the Dirichlet istribution with parameter θ c. Hence References log p(e Θ, C) = c ] log 1 δc N(E g µ gk, σ gk )θ ckn N n = C log N + c g log n k ] δc N(E g µ gk, σ gk )θ ckn. 1] A.A. Alizaeh et al. Distinct types of iffuse large B-cell lymphoma ientifie by gene expression profiling. Nature, 403(3): , February ] S. Basu, A. Banerjee, an R. J. Mooney. Semi-supervise clustering by seeing. In Proceeings of International Conference on Machine Learning 2002, pages 27 34, g k 7

8 3] S. Basu, M. Bilenko, an R. J. Mooney. A probabilistic framework for semi-supervise clustering. In KDD 04: Proceeings of the tenth ACM SIGKDD international conference on Knowlege iscovery an ata mining, pages 59 68, New York, NY, USA, ACM Press. 4] A Bhattacharjee et al. Classification of human lung carcinomas by mrna expression profiling reveals istinct aenocarcinoma subclasses. Proceeings of the National Acaemy of Sciences, 98: , ] D. M. Blei, Anrew Y.Ng, an M. I. Joran. Latent irichlet allocation. Journal of Machine Learning Research, 3: , ] L. Carrivick, S. Rogers, J. Clark, C. Campbell, M. Girolami, an C. Cooper. Ientification of prognostic signatures in breast cancer microarray ata using bayesian techniques. Journal of the Royal Society: Interface, 3: , ] A. Demiriz, K. Bennett, an M. Embrechts. Semi-supervise clustering using genetic algorithms, ] C.L. Blake D.J. Newman, S. Hettich an C.J. Merz. UCI repository of machine learning atabases, mlearn/mlrepository.html, ] E. Garber et al. Diversity of gene expression in aenocarcinoma of the lung. Proceeings National Acaemy Sciences, 98(24): , ] Z. Ghahramani an M. J. Beal. Graphical moels an variational methos. MIT Press, ] M. I. Joran, Z. Ghahramani, T. Jaakkola, an L. K. Saul. An introuction to variational methos for graphical moels. Machine Learning, 37: , ] S. Rogers, M. Girolami, C. Campbell, an R. Breitling. The latent process ecomposition of cdna microarray atasets. IEEE/ACM Transactions on Computational Biology an Bioinformatics, 2: , ] K. Wagstaff an C. Carie. Clustering with instance-level constraints. In Proceeings of the Seventeenth International Conference on Machine Learning, pages , ] E. Xing, A. Ng, M. Joran, an S. Russell. Distance metric learning, with application to clustering with sie-information. In Avances in NIPS, number vol. 15, ] E-J Yeoh et al. Classification, subtype iscovery, an preiction of outcome in peiatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell, 1: ,

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013 Survey Sampling Kosuke Imai Department of Politics, Princeton University February 19, 2013 Survey sampling is one of the most commonly use ata collection methos for social scientists. We begin by escribing