Learning Mixtures of Gaussians with Maximum-a-posteriori Oracle

Size: px

Start display at page:

Download "Learning Mixtures of Gaussians with Maximum-a-posteriori Oracle"

Gloria Hutchinson
6 years ago
Views:

1 Satyaki Mahalanabis Dept of Computer Science University of Rochester Abstract We consier the problem of estimating the parameters of a mixture of istributions, where each component istribution is from a given parametric family eg exponential, Gaussian etc We efine a learning moel in which the learner has access to a maximum-a-posteriori oracle which given any sample from a mixture of istributions, tells the learner which component istribution was the most likely to have generate it We escribe a learning algorithm in this setting which accurately estimates the parameters of a mixture of k spherical Gaussians in R assuming the component Gaussians satisfy a mil separation conition Our algorithm uses only polynomially many in, k) samples an oracle calls, an our separation conition is much weaker than those require by unsupervise learning algorithms like [Arora 0, Vempala 0] Introuction Learning mixtures of istributions, such as mixture of Gaussians, is one of the most well-stuie problems in machine learning see eg [Melnykov 0]) However, most of the known learning algorithms work in an unsupervise setting an not much is known theoretically about how sie information eg in the form of class labels [Basu 0] or pairwise constraints [Shental 03]) can help the learner In this paper, we efine a learning moel in which the learner has access to a maximum-a-posteriori oracle, which for any sample from a mixture of istributions, tells the learner which component istribution was most likely to have generate it Our maximum-a-posteriori oracle can be thought of as an expert who can be aske to guess the most likely Appearing in Proceeings of the 4 th International Conference on Artificial Intelligence an Statistics AISTATS) 0, Fort Lauerale, FL, USA Volume 5 of JMLR: W&CP 5 Copyright 0 by the authors class label for a sample point Then we emonstrate the avantage provie by this oracle by giving an algorithm Algorithm ) which recovers the parameters of a mixture of k, -imensional spherical Gaussians uner a weaker separation assumption an more efficiently than known unsupervise algorithms The number of oracle calls mae by our algorithm is polynomial in, k an is inepenent of the esire error The problem of learning mixture of Gaussians has a long history Various learning moels have been consiere in literature There is the clustering framework [Dasgupta 99, Arora 0] in which the goal is to correctly label each sample generate by the mixture with the Gaussian that generate it Then there is the istribution learning moel [Felman 06] in which the goal is to output a hypothesis mixture which is close to the unknown mixture as possible Probably the most popular moel however is one in which the parameters of the original mixture nee to be recovere eg [Dempster 77, Belkin 0] an this is the moel we use EM [Dempster 77] is perhaps the most well known algorithm in this moel Note that for the clustering problem, it is necessary to have a separation assumption [Arora 0], unlike for the problem of recovering parameters Note also that our moel is more ifficult than learning the istribution Almost all algorithms for clustering or parameter learning which give provable guarantees also require that each Gaussian in the hypothesis mixture be separate from others [Vempala 0, Arora 0] For the clustering framework, [Arora 0] give a lower boun on the separation for the case of spherical Gaussians We also nee to make an separation assumption see 7)) which is much weaker than those require by eg [Arora 0, Vempala 0] However we have access to a maximum-a-posteriori oracle whereas the settings in [Arora 0, Vempala 0, Dasgupta 99, Belkin 0, Moitra 0] are unsupervise Recently [Belkin 0, Moitra 0] have given algorithms for learning parameters which require no separation assumption However the absence of a separation assumption means that their algorithms, unlike ours or [Arora 0, Vempala 0], may require exponentially many samples in 489

2 the number of mixture components Our algorithm is also, not surprisingly, simpler than theirs We also note that our results are for spherical Gaussians only The algorithm in [Dasgupta 99] works only when all Gaussians have the same covariance matrix This was extene to Gaussians having arbitrary covariance matrices by [Arora 0] The results in [Vempala 0], like ours, are for spherical Gaussians They were the first to use spectral methos, an their technique has since been extene to arbitrary log-concave istributions [Kannan 08, Achlioptas 05] Algorithms in [Belkin 0, Moitra 0] can learn mixtures of arbitrary gaussians Even though semi-supervise clustering has been a popular area of research eg [Basu 0, Xing 0, Basu 04], not much is known theoretically about semi-supervise or active learning of mixture of Gaussians A variant of EM using pairwise constraints is given by [Shental 03] They have two types of constraints: pairs of points which were generate by the same Gaussian, an pairs which were not These type of pairwise relations seem to be the most popular moel of semi-supervise learning of mixture moels [Melnykov 0] In our moel we consier a maximum-a-posteriori oracle an as we point out in Section, learning with this oracle can be more challenging than with pairwise constraints Our setting is ifferent from semi-supervise learning because the learner is allowe to choose the points for which it wants labels However the algorithm we give for mixtures of Gaussians queries the label of every sample it raws initially an later on uses only unlabele samples Finally, while our aim in this paper is to theoretically emonstrate the avantage of the maximum-a-posteriori oracle, the oracle moels an expert in a natural way an shoul fin practical applications The various UCI atasets iscusse in the Experimental Results section of [Shental 03] are examples where our oracle learning moel coul be applicable Our Algorithm can be use whenever a ataset can be moele reasonably accurately using a mixture of Gaussians an sie information is available in the form of which clusters selecte ata points belong to The paper is organize as follows In Section we first efine our learning moel subsection ), an then we state our result about learning mixture of Gaussians in this moel subsection ) Then in Section 3 we justify the separation assumption that we require, an then present our algorithm an analyze it Finally, in Section 4 we iscuss how our moel an our algorithm coul be extene Summary of Contributions We first efine the maximum-a-posteriori oracle an then give our result for mixtures of gaussians Learning with Maximum-a-posteriori Oracle We first efine the learning moel Consier a mixture ensity in R efine as µx) = k i= w iµ i x), where i w i = an = min w i > 0 The ensities {µ i } k i= belong to a parametric family eg uniform, i Gaussian etc) an have parameters {θ i } k i= respectively Given an unknown mixture ensity, the goal of a learner is to output estimates {ˆθ i, ŵ i )} k i= of {θ i, w i )} k i= respectively The learner can raw inepenent unlabele) samples from µ by invoking an oracle calle SAMP For any sample x returne by SAMP, the learner can also choose to invoke an oracle, MAP, which labels x with the most likely ensity that coul have generate x ie MAPx) = argmax p i x) where i i, p i x) = w i µ i x) / w j µ j x) is the posterior istribution on { k} For the case of mixture of Gaussians with no two Gaussians being concentric), note that the set of points where or more ensities are equally likely has measure 0 an hence for a ranom sample from µ almost surely there is a unique most likely a posteriori) ensity as efine by ) We point out that in our moel the number of mixture components k is known to the learner Typically we have k We next compare our moel with other similar settings We point out that learning using the MAP oracle is more ifficult than in a semi-supervise setting where each point in a ranom subset of samples is labele accoring to which component actually generate it ie the label is rawn ranomly from posterior p i ) To see this consier a mixture of or more ientical almost concentric Gaussians in which one has a much greater weight than the others If points are labele accoring to the posterior, enough samples from each Gaussian can obtaine to estimate the parameters with arbitrary accuracy whereas the MAP oracle woul only return the label of the heaviest Gaussian an hence MAP oes not help recover the parameters of the other Gaussians Learning with the MAP oracle is more challenging than with pairwise constraints [Shental 03] as well because with sufficiently many constraints one can iscover the actual label of every sample an recover all the parameters with arbitrary accuracy Our moel is ifferent from that of [Castelli 96] where it is assume that the parameters of each component are known the mixing weights are unknown), an the learner just nees to be able to approximate the posterior The example of a mixture of or more ientical almost concentric Gaussians given above shows that in orer for the MAP oracle to help the learner, the component ensities nee to be sufficiently separate Such separation assumptions are often require for unsupervise learning eg see j ) 490

3 Satyaki Mahalanabis Table for the case of mixture of Gaussians), though the separation we require shoul be smaller because we have help from MAP Finally note that any learning algorithm requires Ω/ ) unlabele samples so that there are sufficiently many samples from each component from which to estimate their parameter Mixture of Gaussians Let N u,σ to enote a Gaussian with mean u an covariance matrix Σ The spherical Gaussian in R has ensity N u,σ I x) = πσ ) e x u /σ Table : Sample complexity an separation for learning mixture of Gaussians all algorithms except ours are unsupervise) Separation Complexity [Dasgupta 99] Ω / ) max{σ i, σ j} Õ polyk)) [Dasgupta 07] Ω /4 ) max{σ i, σ j} Õk ) [Arora 0] Ω /4 ln ) max{σ i, σ j} Õ polyk)) [Vempala 0] Ωk /4 lnk)) max{σ i, σ j} Õ 3 k ) [Belkin 0] none poly) e polyk) [Moitra 0] none poly) e polyk) Ours Ω p lnk)) max{σ i, σ j} Õk) where I is the ientity matrix, an enotes the eucliean norm We will emonstrate the avantage provie by the MAP oracle by giving an algorithm Algorithm ) which learns a mixture of Gaussians µx) = k i= w in ui,σ i I x) at a smaller separation than previous unsupervise algorithms like [Dasgupta 99, Arora 0] see Table ) Our main result an separation conition are as follows Theorem Restate from Theorem, Section 3) There is an algorithm which, for any, 0 < < 0, 0 < δ < an for any mixture of k Gaussians µ = i w i N uiσi I satisfying the following separation conition u i u j i j, 30 ln 44 ) / + 3 ) σi + σ ln, j raws at most 0 6 ln 4k+) δ ) 4k+) δ ) ) unlabele) samples, makes 06 ln calls to MAP, an returns {û i, ˆσ i, ŵ i )} k i= such that with probability at least δ, i, u i û i σ i, ˆσ i σ i σ i an ŵ i w i w i 3) Table compares our separation assumption ) with that require by previous unsupervise algorithms We want to be able to learn the mixture parameters for as small a separation as possible, ieally at most polylogarithmic in, k as in ) Our require separation is smaller than that require by eg [Vempala 0] Omax{σ i, σ j }k /4 polylog, k))) Note that uner separation ), the overlap between the Gaussians is such that unsupervise istance base clustering algorithms like [Arora 0, Vempala 0]) may not work We point out that for spherical Gaussians [Arora 0] show that a separation of Ω /4 ) max{σ i, σ j } is necessary such that one can tell with high probability which Gaussian actually generate each sample from µ Recently [Belkin 0, Moitra 0] have given algorithms for mixture of Gaussians which o not require any separation assumption However [Moitra 0] prove that any unsupervise) learning algorithm for mixture of Gaussians with no separation assumption will require exponentially many samples in k In contrast, our algorithm provably) requires only polynomially many in k) samples uner a weak) separation assumption an is much simpler than theirs 3 Detaile Results See Theorem an separation 5) for a more precise statement Theorem essentially says that the number of calls mae by our algorithm to the MAP oracle in orer to achieve an error of is inepenent of ie only O lnk)/ ), provie the separation between the Gaussians is only polylogarithmic in, / The accuracy achieve 3) by our algorithm implies that each estimate component is statistically close to the true component : for each i, N ui,σi I N û i,ˆσ i I This means the hypothesis mxture our algorithm outputs is within O) wrt L or total variation istance) of the true mixture istribution We will use N u,σ to enote both the ensity an the probability measure 3 Notations an Auxiliary Lemmas We will use e, e,, e to enote the usual orthonormal basis of R Given vectors u, v R, we will use u v for their inner prouct Φ will enote the cf of the stanar normal istribution N 0, Lemma Chernoff Boun, see eg [Dubhashi 09]) Let X = n i= X i where {X i } n i= are inepenently is- Table oes not explicitly show epenence on parameter an instea assumes, for the sake of easy comparison, that k wmin c for some constant c > 0 The separation k in [Arora 0] is more complicate an can allow for concentric gaussians if σ i, σ j are ifferent 49

4 tribute in [0, ] Then for any > > 0, [ Pr )E [ X ] X +)E [ X ]] e E[X]/3 4) Fact 3 The KL-ivergence between two Gaussians N u,σ I, N u,σ I is given by KL N u,σ I N u,σ I ) = u u +σ σ + ln ) σ σ We will use the following property of Gaussians in proving Lemma 0 Fact 4 Let ρ = Φ 3/4) Φ /4) For any r 0, > 0 an x, N u,σ, x]) r x u 5 rσ, an 5) Nu,σ [u ρσ, u + ρσ]) N u,σ [u, u + ]) r ρσ rσ 6) Assume wlg that ũ = 0 We have from Markov s inequality that for X Nũ, σ I, }) Nũ, σ I {N u,σ I x) t N u,σ I x) [ N u = Pr,σ I X) N u,σ I X) [ N u t E,σ I X) N u,σ I X) = e x u 4σ t = σr t σ σ e x u 4σ ) / e v r 8 a, ] t ] ) σ/ e x σ σ / ) x π σ ) Definition 5 We will say that a mixture of k Gaussians k i= w in ui,σi I in R is β-separate, where β > 0, if the following pairwise separation conition hols - ) u i u j i j 30 ln + ) σi + σ β ln wj w i j 7) Our require separation conition ) implies that the mixture to be learne is β = wmin 7 -separate We also nee to efine, for all i, j, i j A ij = { w i N ui,σ i I x) w jn uj,σ j I x)} ana i = j i A ij 8) ie A ij is the set where Gaussian i is more likely than j an A i is the set where i is the most likely Gaussian Note that A i is precisely the set of points for which MAP will return label i Consier two Gaussians which are far apart, an a thir Gaussian which is much closer to the first Gaussian than the secon The following Lemma shows that with high probability, the first Gaussian is more likely than the secon one at points rawn from the thir Gaussian Lemma 6 Consier two Gaussians N u,σ I, N u,σ I Let Nũ, σ I be another Gaussian such that u ũ 0 min{ σ, u u } an } σ σ min { σ, σ 0 Then for any t > 0, }) Nũ, σ I {N u,σ I x) t N u,σ I x) t e u u 80σ +σ +3 ) 9) 0) where v = u σ u σ, u 4σ r = σ + σ σ an a = u 4σ For σ σ σ σr, σ σ 5/ Also, uner assumption 9), v r 8 a u u 80σ +σ ) + 8 On substituting these in ), we get }) Nũ, σ I {N u,σ I x) t N u,σ I x) which gives us 0) te u u 80σ +σ +3 ), Next using Lemma 6 we show that in a β-separate mixture, with high probability, component i is more likely a posteriori than any other component j i at a ranom point generate by component i itself Lemma 7 For any β > 0 an any β-separate mixture of k Gaussians i w in ui,σi I, for all i, j, i j, N ui,σ i I A ij) > β ) Proof of Lemma 7) Applying Lemma 6 to Gaussians i, j, we have N ui,σi I A wj ij) e w i u j u i 80σ i +σ j ) +3 Note that conition 9) require for Lemma 6 is satisfie trivially since in this case N u,σ I, Nũ, σ I in Lemma 6 are the same Lemma 7 now follows by substituting separation 7) in the boun above Remark 8 The separation in 7) is the minimum require up to a constant factor) such that ) hols ie such that the i th Gaussian is more likely than others at most points generate by itself One can show this by using 49

5 Satyaki Mahalanabis the following inequality ue to Bretagnolle an Huber see eg [Devroye 0] Chapter 5, Exercise 56) for ensities f, f - f f ) e KLf f) 3 Learning Algorithm an Analysis We first state a subroutine EstimateParameters which takes a set of labele samples as input an uses a sample meianbase estimator for computing the parameters of each Gaussian component This subroutine is use by the main Algorithm Applying this to N ui,σi I, N u j,σj I for the case w i = w j yiels, using Fact 3, N ui,σi I A ij) = Nui,σi I N u j,σj I e ui u j max{σ i,σ j } which shows that separation 7) is tight The following Lemma will be use to show that in Algorithm step, when the estimate parameters {ū i, σ i, w i } k i= are sufficiently close to the true parameters {u i, σ i, w i )} k i=, component N ū i, σ i I is more likely than any other Nūj, σ j I at most points generate by the ith component N ui,σi I This means that once our estimates have converge close to the true parameters, we nee not make any more calls to the MAP oracle an instea approximate MAP using {ū i, σ i, w i } k i= Lemma 9 Let β > 0 Let i w i N ui,σi I be a β- separate mixture of k Gaussians, an i w i Nūi, σ i I be another mixture of k Gaussians such that i ū i u i σ i 0, σ i σ i σ i an w i w i w i 0 Then for all i, j, i j, { N ui,σi I wi Nūi, σ i I w } ) jnūj, σ j I β 3) Proof of Lemma 9) Conition 3) an β-separation 7) imply that conition 9) require for Lemma 6 is satisifie for Gaussians Nūi, σ i I, N ū j, σ j I an N u i,σi I Hence by Lemma 6, { N ui,σi I wi Nūi, σ i I w } ) jnūj, σ j I wj e ū i ū j 80 σ +3 wj i + σ j ) w i w i e u i u j 60σ i +σ j ) +4 4) where we have use the fact that uner separation 7) an 3), ū i ū j 4 5 u i u j an 9 0 wi w i, wj w j 0 an 9 0 σi σ i, σj σ j 0 Substituting 7) in 4) gives the Lemma Subroutine : EstimateParameters Input: Set S of labele samples Output: { ũ i, σ i, w i ) } k i= begin for i = k o S i { x x, i) S } ; ũ i meian x S i x e, meian x S i ) x e,, meian x e ; x S i σ i Φ 3/4) Φ /4) meian x S i x e ũ i e ; w i S i / S + S + + S k ) ; en en For estimating σ i in step of EstimateParameters, we coul have projecte the points in S i along any e l instea of e Intuitively, if the points in S i are rawn from the exact istribution N ui,σ I, then ũ i = u i, σ i = σ i in expectation However in our case, because of the overlap between Gaussians there will be some samples in S i which were generate by other components j i We use the sample meian to filter out these outliers A naïve way learning algorithm woul be to raw ) lnk) O unlabele samples an then invoke MAP on each of them The points labele i can then be use to compute the parameters of the i th Gaussian However we show below that Algorithm actually makes fewer calls to MAP by requesting labels only initially The ) number of labels require thus ecreases to O lnk) ie inepenent of ) Algorithm is base on Lemma 7 an Lemma 9 The parameters m, m are set as state in Theorem Algorithm works in two phases In the first phase steps, ) it uses the MAP oracle to label each ranom samplethe number of samples m is sufficiently large for the estimate centres {ū i } k i= to be within constant istance O) of the respective true centres {u i } k i= In the secon phase steps 3, 4, 5) Algorithm oes not nee to use MAP at all Instea it uses estimates {ū i, σ i, w i )} k i= from the first phase to approximate the MAP oracle see Lemma 9), an the final estimate centres {ū i } k i= are guarantee to be within istance of the respective true ones We first analyze subroutine EstimateParameters before an- 493

6 Algorithm : Learning Algorithm for Mixture of Gaussians Input: Sample sizes m, m Output: { û i, ˆσ i, ŵ i ) } k i= begin S ; S ; for i = m o x SAMP) ; l MAPx) ; S S {x, l)} ; en { ūi, σ i, w i ) } k i= EstimateParametersS) ; for i = m o 3 x SAMP) ; l argmax w l Nūl, σ l l I x) ; 4 S S {x, l )} ; en { 5 ûi, ˆσ i, ŵ i ) } k i= EstimateParametersS ) ; en alyzing Algorithm Lemma 0 Let > δ > 0 an 00 > > 0, 00 > β > 0 Let the input S to EstimateParameters be such that the points { x i x, i) S } are inepenent samples from the mixture of k Gaussians w i N ui,σi I Assume S 6 lnk + )/δ ) Further, in EstimateParameters assume that for each i = k, the set S i ie points labele i) consists of inepenent samples from a ensity µ i, where µ i N ui,σ i I β, an that 5) w i β ) E [ Si / S ] wi + β ) 6) Then with probability at least δ, i ũ i u i 5 + β ) σ i, σ i σ i 9 + β )σ i an w i w i + β )w i Proof of Lemma 0) From Chernoff boun 4) an assumption 6), it follows that given S 6 lnk+)/δ ), with probability at least δ +, for each i k, S i 6 ln+)k/δ ) an w i w i + β )w i Consier any e l For each i, it follows from conition 5) that Pr [ ] X el ũ i e l Nui e l,σi, ũi e l ] ) β X µ i 7) Let F i,l enote the empirical istribution of points in S i projecte onto e l ie F i,l A) = {x Si x e l A} / Si Now ũi e l is the meian of F i,l If S i 6 ln+)k/δ ), it follows from the Chernoff boun 4) that with probability at least δ k+), F i,l, ũ i e l ]) Pr X el ũ i e X µ i[ ] l = Pr [ ] 8) X el ũ i e l X µ i Combining 7), 8) we have N ]) u i e l,σi, ũi e l β + 9) Now 9) an boun 5) in Fact 4 together imply that ũ i e l u i e l 5β + )σ i Hence by applying the union boun over all irections l =, we have that with probability at least δ k+), ũ i u i 5β + ) σ i For σ i a similar argument applies Define ρ = Φ 3/4) Φ /4) From conition 5), it follows that Pr [ ] X e ũ i e ρ σ i X µ i N ui e,σ i [ũi e ρ σ i, ũ i e + ρ σ i ] ) β 0) Note that F i,l [ũi e ρ σ i, ũ i e + ρ σ i ] ) = by efinition If S i 6 lnk + )/δ ) then by the Chernoff boun 4), with probability at least δ k+), F i,l [ũi e ρ σ i, ũ i e + ρ σ i ] ) Pr X e ũ i e ρ σ X µ i[ ] i = Pr [ ] X e ũ i e ρ σi X µ i Combining 0), ) we get N u i e,σi [ui e ρσ i, u i e + ρσ i ] ) N ui e,σi [ũi e ρ σ i, ũ i e + ρ σ i ] ) = N u i e,σi [ũi e ρ σ i, ũ i e + ρ σ i ] ) β + Now it follows from 9) that N u i e,σi [ũi e ρ σ i, ũ i e + ρ σ i ] ) N ui e,σ i [ui e ρ σ i, u i e + ρ σ i ] ) β + ) ) ) 3) 494

7 Satyaki Mahalanabis Combining ),3) we have N u i e,σi [ui e ρσ i, u i e + ρσ i ] ) N ui e,σi [ui e ρ σ i, u i e + ρ σ i ] ) N u i e,σi [ui e ρσ i, u i e + ρσ i ] ) N ui e,σi [ũi e ρ σ i, ũ i e + ρ σ i ] ) + N u i e,σi [ũi e ρ σ i, ũ i e + ρ σ i ] ) N ui e,σ i [ui e ρ σ i, u i e + ρ σ i ] ) β + ) + β + ) = 3β + ) 4) It follows from 4) an boun 6), Fact 4 that σ i σ i 6 ρ β + )σ i < 9β + )σ i with probability at least δ k+) Hence if S i 6 lnk + )/δ ) then with probability at least δ k, ũ i u i 5 + β ) σ i, σ i σ i 9 + β )σ i an w i w i + β )w i The Lemma now follows by applying the union boun to all i = k We finally state our main theorem Theorem Let 0 < < 0, 0 < δ < Then for any mixture of k Gaussians µ = i w in uiσi I satisfying the following separation conition u i u j i j, 30 ln 44 ) / + 3 ) σi + σ ln j ) 5) Algorithm with m = 06 ln 4k+) δ an m = ) 000 ln 4k+) δ returns {û i, ˆσ i, ŵ i )} k i= such that with probability at least δ, i, u i û i σ i, ˆσ i σ i σ i an ŵ i w i w i 6) Proof of Theorem ) Consier running Algorithm with parameters m = 06 ln4k + )/δ), m = 000 ln4k+)/δ) Note that Algorithm makes only m calls to MAP We first analyze steps, Uner separation 5), our mixture µ is β )-separate where β = 7 Therefore by Lemma 7 we have i j N ui,σ i I A ij) β β k 7) which implies that i j, N uj,σ j I A i) β an N ui,σ i I A i) β, 8) where we have use the union boun From 8) it follows that w i β) j w j N uj,σ j I A i) w i + β) 9) Now note that for each i, set of points S i which are labele i in step are rawn inepenently from µ i where µ ix) = P j wjn u j,σ j I x) P j wjn u j,σ j I Ai) if x A i 0 otherwise 30) Using 7), 8) an 9) we have that for each i, µ i N ui,σ i I = N ui,σi I R ) \ A i + µ i N ui,σ i A I i N ui,σi I R ) w j N \ A i + uj,σ j I j i A i j w jn uj,σj I A i) + w i A i j w jn uj,σj I A i) N u i,σi I β + β w i β) + β β < 4β 3) Now S = m 06 ln4k + )/δ) an by 9), w i β) E [ ] Si / S = j w jn uj,σj I A i) w i + β) Hence we can apply Lemma 0 with = 80, β = 4β = 8, δ = δ ) to the call to EstimateParameters in step Thus Lemma 0 implies that with probability at least δ i ū i u i 8 σ i, σ i σ i 0 σ i, an w i w i 45 w i 3) Next we analyze steps 3, 4 an 5 For each i, the set of samples S i which are labele i in step 3 are rawn inepenently from istribution µ i, efine as P j wjn u j,σ µ j I x) i x) = Pj wjn Āi) if x Āi u j,σ j I 33) 0 otherwise, where Āi = j i Āij an Āij = { w i Nūj, σ j I x) w j Nūj, σ j I x)} 495

8 If 3) hols, ) by Lemma 9 we have for each i j, N ui,σi I Āij β This implies proof is similar to that of 3), 9) above) that for each i, µ i N ui,σ i I 4β = 8 an w i β) E [ S i / S ] w i + β) Also S = m 000 ln4k + )/δ) So if 3) hols, we can apply Lemma 0 to the call to EstimateParameters in step 5 with = 8, β = 4β = 8, δ = δ This gives us that with probability at least δ, i û i u i < 5 9 σ i, ˆσ i σ i < σ i an ŵ i w i < 4 w i 34) Finally by the union boun over the two calls to EstimateParameters) we have that 34) hols with probability at least δ 4 Conclusion We have escribe an algorithm, Algorithm which, given ranom samples generate by a unknown mixture of Gaussians an the maximum-a-posteriori oracle ), is able to recover its parameters assuming only a mil separation conition ) between the component Gaussians hols There are a number of ways of extening our algorithm For instance we can prove the correctness of our algorithm only for spherical Gaussians, an it is likely that one can moify our algorithm an the separation conition to work for arbitrary Gaussians as well One can also ask how our algorithm coul better utilize unlabele samples - this coul probably give a better error boun an a better label complexity Finally, one can try to give algorithms for more realistic settings by allowing for the unlabele samples or the oracle to be noisy Acknowlegment I am grateful to Daniel Štefankovič for helpful iscussions an for suggesting improvements to an earlier version of this paper I woul also like to thank the anonymous reviewers for helpful comments This material is base upon work supporte by National Science Founation grants CCF-09045, IIS-04800, ISS an University of Pennsylvania subcontract NSF # Any opinions, finings, an conclusions or recommenations expresse in this material are those of the author an o not necessarily reflect the views of above name organizations References [Achlioptas 05] Dimitris Achlioptas & Frank McSherry On Spectral Learning of Mixtures of Distributions In COLT, pages , 005 [Arora 0] [Basu 0] [Basu 04] [Belkin 0] [Castelli 96] Sanjeev Arora & Ravi Kannan Learning mixtures of arbitrary gaussians In STOC, pages 47 57, 00 Sugato Basu, Arinam Banerjee & Raymon J Mooney Semi-supervise Clustering by Seeing In ICML, pages 7 34, 00 Sugato Basu, Arinam Banerjee & Raymon J Mooney Active Semi-Supervision for Pairwise Constraine Clustering In SDM, 004 Mikhail Belkin & Kaushik Sinha Polynomial Learning of Distribution Families In IEEE FOCS, 00 Vittorio Castelli & Thomas M Cover The relative value of labele an unlabele samples in pattern recognition with an unknown mixing parameter IEEE Transactions on Information Theory, vol 4, no 6, pages 0 7, 996 [Dasgupta 99] Sanjoy Dasgupta Learning Mixtures of Gaussians In FOCS, pages , 999 [Dasgupta 07] [Dempster 77] Sanjoy Dasgupta & Leonar J Schulman A Probabilistic Analysis of EM for Mixtures of Separate, Spherical Gaussians Journal of Machine Learning Research, vol 8, pages 03 6, 007 A P Dempster, N M Lair & D B Rubin Maximum likelihoo from incomplete ata via the EM algorithm J Roy Statist Soc Ser B, vol, pages 38, 977 [Devroye 0] Luc Devroye & Gabor Lugosi Combinatorial methos in ensity estimation Springer Series in Statistics Springer- Verlag, New York, 00 [Dubhashi 09] [Felman 06] Devatt Dubhashi & Alessanro Panconesi Concentration of measure for the analysis of ranomize algorithms Cambrige University Press, 009 Jon Felman, Rocco A Serveio & Ryan O Donnell PAC Learning Axis-Aligne Mixtures of Gaussians with No Separation Assumption In COLT, pages 0 34,

9 Satyaki Mahalanabis [Kannan 08] [Melnykov 0] [Moitra 0] [Shental 03] [Vempala 0] Ravinran Kannan, Hai Salmasian & Santosh Vempala The Spectral Metho for General Mixture Moels SIAM J Comput, vol 38, no 3, pages 4 56, 008 Voloymyr Melnykov & Ranjan Maitra Finite mixture moels an moel-base clustering Statist Surv, vol 4, 00 Ankur Moitra & Gregory Valiant Settling the Polynomial Learnability of Mixtures of Gaussians In IEEE FOCS, 00 Noam Shental, Aharon Bar-Hillel, Tomer Hertz & Daphna Weinshall Computing Gaussian Mixture Moels with EM Using Equivalence Constraints In NIPS, 003 Santosh Vempala & Grant Wang A Spectral Algorithm for Learning Mixtures of Distributions In IEEE FOCS, 00 [Xing 0] Eric P Xing, Anrew Y Ng, Michael I Joran & Stuart J Russell Distance Metric Learning with Application to Clustering with Sie-Information In NIPS, pages 505 5,

Multi-View Clustering via Canonical Correlation Analysis

Technical Report TTI-TR-2008-5 Multi-View Clustering via Canonical Correlation Analysis Kamalika Chauhuri UC San Diego Sham M. Kakae Toyota Technological Institute at Chicago ABSTRACT Clustering ata in