Sharp bounds for learning a mixture of two Gaussians

Size: px

Start display at page:

Download "Sharp bounds for learning a mixture of two Gaussians"

Lambert Cain
5 years ago
Views:

1 Sharp bounds for learning a mixture of two Gaussians Moritz Hardt Eric Price IBM Almaden Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

2 Problem Height (cm) Height distribution of American 20 year olds. Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

3 Problem Height (cm) Height distribution of American 20 year olds. Male/female heights are very close to Gaussian distribution. Can we learn the average male and female heights from unlabeled population data? How many samples to learn µ 1, µ 2 to ±ɛσ? Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

4 Gaussian Mixtures: Origins Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

Gaussian Mixtures: Origins Contributions to the Mathematical Theory of Evolution, Karl Pearson, 1894 Pearson s naturalist buddy measured lots of crab body parts.

5 Gaussian Mixtures: Origins Contributions to the Mathematical Theory of Evolution, Karl Pearson, 1894 Pearson s naturalist buddy measured lots of crab body parts. Most lengths seemed to follow the normal distribution (a recently coined name) But the forehead size wasn t symmetric. Maybe there were actually two species of crabs? Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

6 More previous work Pearson 1894: proposed method for 2 Gaussians Method of moments Other empirical papers over the years: Royce 58, Gridgeman 70, Gupta-Huang 80 Provable results assuming the components are well-separated: Clustering: Dasgupta 99, DA 00 Spectral methods: VW 04, AK 05, KSV 05, AM 05, VW 05 Kalai-Moitra-Valiant 2010: first general polynomial bound. Extended to general k mixtures: Moitra-Valiant 10, Belkin-Sinha 10 The KMV polynomial is very large. Our result: tight upper and lower bounds for the sample complexity. For k = 2 mixtures, arbitrary d dimensions. Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

7 Learning the components vs. learning the sum Height (cm) It s important that we want to learn the individual components: Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

8 Learning the components vs. learning the sum Height (cm) It s important that we want to learn the individual components: Male/female average heights, std. deviations. Getting ɛ approximation in TV norm to overall distribution takes Θ(1/ɛ 2 ) samples from black box techniques. Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

9 Learning the components vs. learning the sum Height (cm) It s important that we want to learn the individual components: Male/female average heights, std. deviations. Getting ɛ approximation in TV norm to overall distribution takes Θ(1/ɛ 2 ) samples from black box techniques. Quite general: for any mixture of known unimodal distributions. [Chan, Diakonikolas, Servedio, Sun 13] Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

10 We show Pearson s 1894 method can be extended to be optimal! Suppose we want means and variances to ɛ accuracy: µ i to ±ɛσ σ 2 i to ±ɛ 2 σ 2 In one dimension: Θ(1/ɛ 12 ) samples necessary and sufficient. Previously: O(1/ɛ 300 ). Moreover: algorithm is almost the same as Pearson (1894). In d dimensions, Θ(1/ɛ 12 log d) samples necessary and sufficient. σ 2 is max variance in any coordinate. Get each entry of covariance matrix to ±ɛ 2 σ 2. Previously: O((d/ɛ) 300,000 ). Caveat: assume p 1, p 2 are bounded away from zero. Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

11 Outline 1 Algorithm in One Dimension 2 Algorithm in d Dimensions 3 Lower Bound Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

12 Outline 1 Algorithm in One Dimension 2 Algorithm in d Dimensions 3 Lower Bound Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

13 Method of Moments Height (cm) We want to learn five parameters: µ 1, µ 2, σ 1, σ 2, p 1, p 2 with p 1 + p 2 = 1. Moments give polynomial equations in parameters: M 1 := E[x 1 ] = p 1 µ 1 + p 2 µ 2 M 2 := E[x 2 ] = p 1 µ p 2µ p 1σ p 2σ 2 2 M 3, M 4, M 5 = [...] Use our samples to estimate the moments. Solve the system of equations to find the parameters. Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

Method of Moments Solving the system Start with five parameters. First, can assume mean zero: Convert to central moments M 2 = M 2 M1 2 is independent of translation.

14 Method of Moments Solving the system Start with five parameters. First, can assume mean zero: Convert to central moments M 2 = M 2 M1 2 is independent of translation. Analogously, can assume min(σ 1, σ 2 ) = 0 by converting to excess moments X 4 = M 4 3M2 2 is independent of adding N(0, σ2 ). Excess kurtosis coined by Pearson, appearing in every Wikipedia probability distribution infobox. Leaves three free parameters. Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

15 Method of Moments: system of equations Convenient to reparameterize by Gives that X 3 = α(β + 3γ) X 4 = α( 2α + β 2 + 6βγ + 3γ 2 ) α = µ 1 µ 2, β = µ 1 + µ 2, γ = σ2 2 σ2 1 µ 2 µ 1 X 5 = α(β 3 8αβ + 10β 2 γ + 15γ 2 β 20αγ) X 6 = α(16α 2 12αβ 2 60αβγ + β β 3 γ + 45β 2 γ βγ 3 ) All my attempts to obtain a simpler set have failed... It is possible, however, that some other... equations of a less complex kind may ultimately be found. Karl Pearson Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

16 Pearson s Polynomial Chug chug chug... Get a 9th degree polynomial in the excess moments X 3, X 4, X 5 : p(α) = 8α X 4 α 7 12X3 2 α6 + (24X 3 X X4 2 )α5 + (6X X3 2 X 4)α 4 + (96X3 4 36X 3X 4 X 5 + 9X4 3 )α3 + (24X3 3 X X3 2 X 4 2 )α2 32X3 4 X 4α + 8X3 6 = 0 Easy to go from solutions α to mixtures µ i, σ i, p i. Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

17 Pearson s Polynomial Get a 9th degree polynomial in the excess moments X 3, X 4, X 5. Positive roots correspond to mixtures that match on five moments. Usually have two roots. Pearson s proposal: choose candidate with closer 6th moment. Works because six moments uniquely identify mixture [KMV] How robust to moment estimation error? Usually works well Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

18 Pearson s Polynomial Get a 9th degree polynomial in the excess moments X 3, X 4, X 5. Positive roots correspond to mixtures that match on five moments. Usually have two roots. Pearson s proposal: choose candidate with closer 6th moment. Works because six moments uniquely identify mixture [KMV] How robust to moment estimation error? Usually works well Not when there s a double root. Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

19 Making it robust in all cases Can create another ninth degree polynomial p 6 from X 3, X 4, X 5, X 6. Then α is the unique positive root of r(α) := p 5 (α) 2 + p 6 (α) 2 = 0. Therefore q(x) := r/(x α) 2 has no positive roots. Would like that q(x) c > 0 for all x and all mixtures α, β, γ. Then for p 5 p 6, p 6 p 6 ɛ, α arg min r(x) ɛ/ c. Compactness: true for any closed and bounded region. Bounded: For unbounded variables, dominating terms show q. Closed: Issue is that x > 0 isn t closed. Can use X3, X 4 to get an O(1) approximation α to α. x [α/10, α] is closed. Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

20 Result Large Small Suppose the two components have means σ apart. Then if we know M i to ±ɛ( σ) i, the algorithm recovers the means to ±ɛ σ. Therefore O( 12 ɛ 2 ) samples give an ɛ approximation. If components are Ω(1) standard deviations apart, O(1/ɛ 2 ) samples suffice. In general, O(1/ɛ 12 ) samples suffice to get ɛσ accuracy. Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

21 Outline 1 Algorithm in One Dimension 2 Algorithm in d Dimensions 3 Lower Bound Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

22 Algorithm in d dimensions Idea: project to lower dimensions. Look at individual coordinates: get {µ 1,i, µ 2,i } to ±ɛσ. How do we piece them together? Suppose we could solve d = 2: Can match up {µ1,i, µ 2,i } with {µ 1,j, µ 2,j }. Solve d = 2: Project x v, x for many random v. For µ µ, will have µ, v µ, v with constant probability. So we solve d case with poly(d) calls to 1-dimensional case. Only loss is log(1/δ) log(d/δ): Θ(1/ɛ 12 log(d/δ)) samples Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

23 Outline 1 Algorithm in One Dimension 2 Algorithm in d Dimensions 3 Lower Bound Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

24 Lower bound in one dimension The algorithm takes O(ɛ 12 ) samples because it uses six moments Necessary to get sixth moment to ±(ɛσ) 6. Let F, F be any two mixtures with five matching moments: Constant means and variances. Add N(0, σ 2 ) to each mixture as σ grows. Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

25 Lower bound in one dimension The algorithm takes O(ɛ 12 ) samples because it uses six moments Necessary to get sixth moment to ±(ɛσ) 6. Let F, F be any two mixtures with five matching moments: Constant means and variances. Add N(0, σ 2 ) to each mixture as σ grows. Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

26 Lower bound in one dimension The algorithm takes O(ɛ 12 ) samples because it uses six moments Necessary to get sixth moment to ±(ɛσ) 6. Let F, F be any two mixtures with five matching moments: Constant means and variances. Add N(0, σ 2 ) to each mixture as σ grows. Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

27 Lower bound in one dimension The algorithm takes O(ɛ 12 ) samples because it uses six moments Necessary to get sixth moment to ±(ɛσ) 6. Let F, F be any two mixtures with five matching moments: Constant means and variances. Add N(0, σ 2 ) to each mixture as σ grows. Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

28 Lower bound in one dimension The algorithm takes O(ɛ 12 ) samples because it uses six moments Necessary to get sixth moment to ±(ɛσ) 6. Let F, F be any two mixtures with five matching moments: Constant means and variances. Add N(0, σ 2 ) to each mixture as σ grows. Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

29 Lower bound in one dimension The algorithm takes O(ɛ 12 ) samples because it uses six moments Necessary to get sixth moment to ±(ɛσ) 6. Let F, F be any two mixtures with five matching moments: Constant means and variances. Add N(0, σ 2 ) to each mixture as σ grows. Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

30 Lower bound in one dimension The algorithm takes O(ɛ 12 ) samples because it uses six moments Necessary to get sixth moment to ±(ɛσ) 6. Let F, F be any two mixtures with five matching moments: Constant means and variances. Add N(0, σ 2 ) to each mixture as σ grows. Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

31 Lower bound in one dimension The algorithm takes O(ɛ 12 ) samples because it uses six moments Necessary to get sixth moment to ±(ɛσ) 6. Let F, F be any two mixtures with five matching moments: Constant means and variances. Add N(0, σ 2 ) to each mixture as σ grows. Claim: Ω(σ 12 ) samples necessary to distinguish the distributions. Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

32 Lower bound in one dimension Two mixtures F, F with F F. Have TV(F, F ) 1/σ 6. Shows Ω(σ 6 ) samples, O(σ 12 ) samples. Improve using squared Hellinger distance. H 2 (P, Q) := 1 2 ( p(x) q(x)) 2 dx H 2 is subadditive on product measures Sample complexity is Ω(1/H 2 (F, F )) H 2 TV H, but often H TV. Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

33 Bounding the Hellinger distance: general idea Definition H 2 (P, Q) = 1 2 ( p(x) q(x)) 2 dx = 1 p(x)q(x)dx If q(x) = (1 + (x))p(x) for some small, then [Pollard 00] H 2 (p, q) = (x)p(x)dx = 1 E x p [ 1 + (x)] = 1 E [1 + (x)/2 O( 2 (x))] (x) x p }{{} x Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

34 Bounding the Hellinger distance: general idea Definition H 2 (P, Q) = 1 2 ( p(x) q(x)) 2 dx = 1 p(x)q(x)dx If q(x) = (1 + (x))p(x) for some small, then [Pollard 00] H 2 (p, q) = (x)p(x)dx = 1 E x p [ 1 + (x)] = 1 E [1 + (x)/2 O( 2 (x))] (x) x p }{{} x Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

35 Bounding the Hellinger distance: general idea Definition H 2 (P, Q) = 1 2 ( p(x) q(x)) 2 dx = 1 p(x)q(x)dx If q(x) = (1 + (x))p(x) for some small, then [Pollard 00] H 2 (p, q) = (x)p(x)dx = 1 E x p [ 1 + (x)] = 1 E [1 + (x) /2 O( 2 (x))] (x) x p }{{}}{{} q(x) p(x)=0 x Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

36 Bounding the Hellinger distance: general idea Definition H 2 (P, Q) = 1 2 ( p(x) q(x)) 2 dx = 1 p(x)q(x)dx If q(x) = (1 + (x))p(x) for some small, then [Pollard 00] H 2 (p, q) = (x)p(x)dx = 1 E x p [ 1 + (x)] = 1 E [1 + (x) /2 O( 2 (x))] (x) x p }{{}}{{} q(x) p(x)=0 x E x p [ 2 (x)] Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

37 Bounding the Hellinger distance: general idea Definition H 2 (P, Q) = 1 2 ( p(x) q(x)) 2 dx = 1 p(x)q(x)dx If q(x) = (1 + (x))p(x) for some small, then [Pollard 00] H 2 (p, q) = (x)p(x)dx = 1 E x p [ 1 + (x)] = 1 E [1 + (x) /2 O( 2 (x))] (x) x p }{{}}{{} q(x) p(x)=0 x E x p [ 2 (x)] Compare to TV (p, q) = 1 2 E x p[ (x) ] Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

38 Bounding the Hellinger distance: our setting Lemma Let F, F be two subgaussian distributions with k matching moments and constant parameters. Then for G, G = F + N(0, σ 2 ), F + N(0, σ 2 ), H 2 (G, G ) 1/σ 2k+2. Can show both G, G are within O(1) of N(0, σ 2 ) over [ σ 2, σ 2 ]. We have that (x) G (x) G(x) ν(x t) = (F (t) F(t))dt ν(x) ν(x) ( ) d 1 + x/σ σ t d (F (t) F (t))dt d d=0 ( ) d ( ) k x/σ 1 + x/σ σ σ so d=k+1 H 2 (G, G ) E x G [ (x) 2 ] 1/σ 2k+2 Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

39 Lower bound in one dimension Add N(0, σ 2 ) to two mixtures with five matching moments. Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

40 Lower bound in one dimension Add N(0, σ 2 ) to two mixtures with five matching moments. Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

41 Lower bound in one dimension Add N(0, σ 2 ) to two mixtures with five matching moments. Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

42 Lower bound in one dimension Add N(0, σ 2 ) to two mixtures with five matching moments. Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

43 Lower bound in one dimension Add N(0, σ 2 ) to two mixtures with five matching moments. Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

44 Lower bound in one dimension Add N(0, σ 2 ) to two mixtures with five matching moments. Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

45 Lower bound in one dimension Add N(0, σ 2 ) to two mixtures with five matching moments. Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

46 Lower bound in one dimension Add N(0, σ 2 ) to two mixtures with five matching moments. For G = 1 2 N( 1, 1 + σ2 ) N(1, 2 + σ2 ) G 0.297N( 1.226, σ 2 ) N(0.517, σ 2 ) have H 2 (G, G ) 1/σ 12. Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

47 Lower bound in one dimension Add N(0, σ 2 ) to two mixtures with five matching moments. For G = 1 2 N( 1, 1 + σ2 ) N(1, 2 + σ2 ) G 0.297N( 1.226, σ 2 ) N(0.517, σ 2 ) have H 2 (G, G ) 1/σ 12. Therefore distinguishing G from G takes Ω(σ 12 ) samples. Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

48 Lower bound in one dimension Add N(0, σ 2 ) to two mixtures with five matching moments. For G = 1 2 N( 1, 1 + σ2 ) N(1, 2 + σ2 ) G 0.297N( 1.226, σ 2 ) N(0.517, σ 2 ) have H 2 (G, G ) 1/σ 12. Therefore distinguishing G from G takes Ω(σ 12 ) samples. Cannot learn either means to ±ɛσ or variance to ±ɛ 2 σ 2 with o(1/ɛ 12 ) samples. Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

49 Recap and open questions Our result: Θ(ɛ 12 log d) samples necessary and sufficient to estimate µ i to ±ɛσ, σi 2 to ±ɛ 2 σ 2. If the means have σ separation, just O(ɛ 2 12 ) for ɛ σ accuracy. Extend to k > 2? Lower bound extends, so Ω(ɛ 6k ). Do we really care about finding an O(ɛ 18 ) algorithm? Solving the system of equations gets nasty. Automated way of figuring out whether solution to system of polynomial equations is robust? Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

50 Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians / 25

Tight Bounds for Learning a Mixture of Two Gaussians

Tight Bounds for Learning a Mixture of Two Gaussians Moritz Hardt m@mrtz.org IBM Research Almaden Eric Price ecprice@cs.utexas.edu The University of Texas at Austin May 16, 2015 Abstract We consider the