Efficient algorithms for estimating multi-view mixture models

Size: px

Start display at page:

Download "Efficient algorithms for estimating multi-view mixture models"

Frederica Blake
5 years ago
Views:

1 Efficient algorithms for estimating multi-view mixture models Daniel Hsu Microsoft Research, New England

2 Outline Multi-view mixture models Multi-view method-of-moments Some applications and open questions Concluding remarks

3 Part 1. Multi-view mixture models Multi-view mixture models Unsupervised learning and mixture models Multi-view mixture models Complexity barriers Multi-view method-of-moments Some applications and open questions Concluding remarks

4 Unsupervised learning Many modern applications of machine learning: high-dimensional data from many diverse sources, but mostly unlabeled.

5 Unsupervised learning Many modern applications of machine learning: high-dimensional data from many diverse sources, but mostly unlabeled. Unsupervised learning: extract useful info from this data. Disentangle sub-populations in data source. Discover useful representations for downstream stages of learning pipeline (e.g., supervised learning).

6 Mixture models Simple latent variable model: mixture model h x h [k] := {1, 2,..., k} (hidden); x R d (observed); Pr[ h = j ] = w j ; x h Ph ; so x has a mixture distribution P( x) = w 1 P 1 ( x) + w 2 P 2 ( x) + + w k P k ( x).

7 Mixture models Simple latent variable model: mixture model h x h [k] := {1, 2,..., k} (hidden); x R d (observed); Pr[ h = j ] = w j ; x h Ph ; so x has a mixture distribution P( x) = w 1 P 1 ( x) + w 2 P 2 ( x) + + w k P k ( x). Typical use: learn about constituent sub-populations (e.g., clusters) in data source.

8 Multi-view mixture models Can we take advantage of diverse sources of information?

9 Multi-view mixture models Can we take advantage of diverse sources of information? h [k ], h ~x1 ~x2 ~x` ~x1 Rd1, ~x2 Rd2,..., ~x` Rd`. k = # components, ` = # views (e.g., audio, video, text). View 1: ~x1 Rd1 View 2: ~x2 Rd2 View 3: ~x3 Rd3

10 Multi-view mixture models Can we take advantage of diverse sources of information? h x 1 x 2 x l h [k], x 1 R d 1, x 2 R d 2,..., x l R d l. k = # components, l = # views (e.g., audio, video, text). View 1: x 1 R d 1 View 2: x 2 R d 2 View 3: x 3 R d 3

11 Multi-view mixture models Multi-view assumption: Views are conditionally independent given the component. View 1: x 1 R d 1 View 2: x 2 R d 2 View 3: x 3 R d 3 Larger k (# components): more sub-populations to disentangle. Larger l (# views): more non-redundant sources of information.

12 Semi-parametric estimation task Parameters of component distributions: Mixing weights w j := Pr[h = j], j [k]; Conditional means µ v,j := E[ x v h = j] R dv, j [k], v [l]. Goal: Estimate mixing weights and conditional means from independent copies of ( x 1, x 2,..., x l ).

13 Semi-parametric estimation task Parameters of component distributions: Mixing weights w j := Pr[h = j], j [k]; Conditional means µ v,j := E[ x v h = j] R dv, j [k], v [l]. Goal: Estimate mixing weights and conditional means from independent copies of ( x 1, x 2,..., x l ). Questions: 1. How do we estimate {w j } and { µ v,j } without observing h? 2. How many views l are sufficient to learn with poly(k) computational / sample complexity?

14 Some barriers to efficient estimation Challenge: many difficult parametric estimation tasks reduce to this estimation problem.

15 Some barriers to efficient estimation Challenge: many difficult parametric estimation tasks reduce to this estimation problem. Cryptographic barrier: discrete HMM parameter estimation as hard as learning parity functions with noise (Mossel-Roch, 06).

16 Some barriers to efficient estimation Challenge: many difficult parametric estimation tasks reduce to this estimation problem. Cryptographic barrier: discrete HMM parameter estimation as hard as learning parity functions with noise (Mossel-Roch, 06). Statistical barrier: Gaussian mixtures in R 1 can require exp(ω(k)) samples to estimate parameters, even if components are wellseparated (Moitra-Valiant, 10).

Statistical barrier: Gaussian mixtures in R 1 can require exp(ω(k)) samples to estimate parameters, even if components are

17 Some barriers to efficient estimation Challenge: many difficult parametric estimation tasks reduce to this estimation problem. Cryptographic barrier: discrete HMM parameter estimation as hard as learning parity functions with noise (Mossel-Roch, 06). Statistical barrier: Gaussian mixtures in R 1 can require exp(ω(k)) samples to estimate parameters, even if components are wellseparated (Moitra-Valiant, 10). In practice: resort to local search (e.g., EM), often subject to slow convergence and inaccurate local optima.

18 Making progress: Gaussian mixture model Gaussian mixture model: problem becomes easier if assume some large minimum separation between component means (Dasgupta, 99): sep := min i j µ i µ j max{σ i, σ j }.

19 Making progress: Gaussian mixture model Gaussian mixture model: problem becomes easier if assume some large minimum separation between component means (Dasgupta, 99): sep := min i j µ i µ j max{σ i, σ j }. sep = Ω(d c ): interpoint distance-based methods / EM (Dasgupta, 99; Dasgupta-Schulman, 00; Arora-Kannan, 00) sep = Ω(k c ): first use PCA to k dimensions (Vempala-Wang, 02; Kannan-Salmasian-Vempala, 05; Achlioptas-McSherry, 05) Also works for mixtures of log-concave distributions.

20 Making progress: Gaussian mixture model Gaussian mixture model: problem becomes easier if assume some large minimum separation between component means (Dasgupta, 99): sep := min i j µ i µ j max{σ i, σ j }. sep = Ω(d c ): interpoint distance-based methods / EM (Dasgupta, 99; Dasgupta-Schulman, 00; Arora-Kannan, 00) sep = Ω(k c ): first use PCA to k dimensions (Vempala-Wang, 02; Kannan-Salmasian-Vempala, 05; Achlioptas-McSherry, 05) Also works for mixtures of log-concave distributions. No minimum separation requirement: method-of-moments but exp(ω(k)) running time / sample size (Kalai-Moitra-Valiant, 10; Belkin-Sinha, 10; Moitra-Valiant, 10)

21 Making progress: discrete hidden Markov models Hardness reductions create HMMs with degenerate output and next-state distributions Pr[ x t = h t = 1] Pr[ x t = h t = 2] Pr[ x t = h t = 3]

22 Making progress: discrete hidden Markov models Hardness reductions create HMMs with degenerate output and next-state distributions Pr[ x t = h t = 1] Pr[ x t = h t = 2] Pr[ x t = h t = 3] These instances are avoided by assuming parameter matrices are full-rank (Mossel-Roch, 06; Hsu-Kakade-Zhang, 09)

23 What we do This work: given 3 views, mild non-degeneracy conditions imply efficient algorithms for estimation.

24 What we do This work: given 3 views, mild non-degeneracy conditions imply efficient algorithms for estimation. Non-degeneracy condition for multi-view mixture model: Conditional means { µ v,1, µ v,2,..., µ v,k } are linearly independent for each view v [l], and w > 0. Requires high-dimensional observations (d v k)!

25 What we do This work: given 3 views, mild non-degeneracy conditions imply efficient algorithms for estimation. Non-degeneracy condition for multi-view mixture model: Conditional means { µ v,1, µ v,2,..., µ v,k } are linearly independent for each view v [l], and w > 0. Requires high-dimensional observations (d v k)! New efficient learning guarantees for parametric models (e.g., mixtures of Gaussians, general HMMs)

26 What we do This work: given 3 views, mild non-degeneracy conditions imply efficient algorithms for estimation. Non-degeneracy condition for multi-view mixture model: Conditional means { µ v,1, µ v,2,..., µ v,k } are linearly independent for each view v [l], and w > 0. Requires high-dimensional observations (d v k)! New efficient learning guarantees for parametric models (e.g., mixtures of Gaussians, general HMMs) General tensor decomposition framework applicable to a wide variety of estimation problems.

27 Part 2. Multi-view method-of-moments Multi-view mixture models Multi-view method-of-moments Overview Structure of moments Uniqueness of decomposition Computing the decomposition Asymmetric views Some applications and open questions Concluding remarks

28 The plan I First, assume views are (conditionally) exchangeable, and derive basic algorithm.

29 The plan I First, assume views are (conditionally) exchangeable, and derive basic algorithm. I Then, provide reduction from general multi-view setting to exchangeable case.

30 Simpler case: exchangeable views (Conditionally) exchangeable views: assume the views have the same conditional means, i.e., E[ x v h = j ] µ j, j [k], v [l].

31 Simpler case: exchangeable views (Conditionally) exchangeable views: assume the views have the same conditional means, i.e., E[ x v h = j ] µ j, j [k], v [l]. Motivating setting: bag-of-words model, x 1, x 2,..., x l l exchangeable words in a document. One-hot encoding: x v = e i v-th word in document is i-th word in vocab (where e i {0, 1} d has 1 in i-th position, 0 elsewhere). ( µ j ) i = E[( x v ) i h = j] = Pr[ x v = e i h = j], i [d], j [k].

32 Key ideas 1. Method-of-moments: conditional means are revealed by appropriate low-rank decompositions of moment matrices and tensors. 2. Third-order tensor decomposition is uniquely determined by directions of (locally) maximum skew. 3. The required local optimization can be efficiently performed in poly time.

33 Algebraic structure in moments Recall: E[ x v h = j ] = µ j.

34 Algebraic structure in moments Recall: E[ x v h = j ] = µ j. By conditional independence and exchangeability of x 1, x 2,..., x l given h, Pairs := E[ x 1 x 2 ] = E [ E[ x 1 h] E[ x 2 h] ] = E[ µ h µ h ] k = w i µ i µ i R d d. i=1

35 Algebraic structure in moments Recall: E[ x v h = j ] = µ j. By conditional independence and exchangeability of x 1, x 2,..., x l given h, Pairs := E[ x 1 x 2 ] = E [ E[ x 1 h] E[ x 2 h] ] = E[ µ h µ h ] k = w i µ i µ i R d d. i=1 Triples := E[ x 1 x 2 x 3 ] k = w i µ i µ i µ i R d d d, etc. i=1 (If only we could extract these low-rank decompositions... )

36 2nd moment: subspace spanned by conditional means

37 2nd moment: subspace spanned by conditional means Non-degeneracy assumption ({ µ i } linearly independent)

38 2nd moment: subspace spanned by conditional means Non-degeneracy assumption ({ µ i } linearly independent) = Pairs = k i=1 w i µ i µ i symmetric psd and rank k

39 2nd moment: subspace spanned by conditional means Non-degeneracy assumption ({ µ i } linearly independent) = Pairs = k i=1 w i µ i µ i symmetric psd and rank k = Pairs equips k-dim subspace span{ µ 1, µ 2,..., µ k } with inner product Pairs( x, y) := x Pairs y.

40 2nd moment: subspace spanned by conditional means Non-degeneracy assumption ({ µ i } linearly independent) = Pairs = k i=1 w i µ i µ i symmetric psd and rank k = Pairs equips k-dim subspace span{ µ 1, µ 2,..., µ k } with inner product Pairs( x, y) := x Pairs y. However, { µ i } not generally determined by just Pairs (e.g., { µ i } are not necessarily orthogonal).

41 2nd moment: subspace spanned by conditional means Non-degeneracy assumption ({ µ i } linearly independent) = Pairs = k i=1 w i µ i µ i symmetric psd and rank k = Pairs equips k-dim subspace span{ µ 1, µ 2,..., µ k } with inner product Pairs( x, y) := x Pairs y. However, { µ i } not generally determined by just Pairs (e.g., { µ i } are not necessarily orthogonal). Must look at higher-order moments?

42 3rd moment: (cross) skew maximizers Claim: Up to third-moment (i.e., 3 views) suffices. View Triples: R d R d R d R as trilinear form.

43 3rd moment: (cross) skew maximizers Claim: Up to third-moment (i.e., 3 views) suffices. View Triples: R d R d R d R as trilinear form. Theorem Each isolated local maximizer η of max Triples( η, η, η) s.t. Pairs( η, η) 1 η R d satisfies, for some i [k], Pairs η = w i µ i, Triples( η, η, η ) = 1 wi.

44 3rd moment: (cross) skew maximizers Claim: Up to third-moment (i.e., 3 views) suffices. View Triples: R d R d R d R as trilinear form. Theorem Each isolated local maximizer η of max Triples( η, η, η) s.t. Pairs( η, η) 1 η R d satisfies, for some i [k], Pairs η = w i µ i, Triples( η, η, η ) = 1 wi. Also: these maximizers can be found efficiently and robustly.

45 Variational analysis max Triples( η, η, η) s.t. Pairs( η, η) 1 η R d

46 Variational analysis max Triples( η, η, η) s.t. Pairs( η, η) 1 η R d (Substitute Pairs = k i=1 w i µ i µ i and Triples = k i=1 w i µ i µ i µ i.)

47 Variational analysis max η R d k w i ( η µ i ) 3 i=1 s.t. k w i ( η µ i ) 2 1 (Substitute Pairs = k i=1 w i µ i µ i and Triples = k i=1 w i µ i µ i µ i.) i=1

48 Variational analysis max η R d k w i ( η µ i ) 3 i=1 s.t. k w i ( η µ i ) 2 1 i=1 (Let θ i := w i ( η µ i ) for i [k].)

49 Variational analysis max η R d k i=1 1 wi ( w i η µ i ) 3 s.t. k ( w i η µ i ) 2 1 i=1 (Let θ i := w i ( η µ i ) for i [k].)

50 Variational analysis max θ R k k i=1 1 wi θ 3 i s.t. k θi 2 1 i=1 (Let θ i := w i ( η µ i ) for i [k].)

51 Variational analysis max θ R k k i=1 1 wi θ 3 i s.t. k θi 2 1 i=1 (Let θ i := w i ( η µ i ) for i [k].) Isolated local maximizers θ (found via gradient ascent) are e 1 = (1, 0, 0,... ), e 2 = (0, 1, 0,... ), etc. which means that each η satisfies, for some i [k], wj ( η µ j ) = {1 j = i 0 j i.

52 Variational analysis max θ R k k i=1 1 wi θ 3 i s.t. k θi 2 1 i=1 (Let θ i := w i ( η µ i ) for i [k].) Isolated local maximizers θ (found via gradient ascent) are e 1 = (1, 0, 0,... ), e 2 = (0, 1, 0,... ), etc. which means that each η satisfies, for some i [k], Therefore wj ( η µ j ) = {1 j = i Pairs η = 0 j i. k ( ) w j µ j η µ j = wi µ i. j=1

53 Extracting all isolated local maximizers 1. Start with T := Triples.

54 Extracting all isolated local maximizers 1. Start with T := Triples. 2. Find isolated local maximizer of T ( η, η, η) s.t. Pairs( η, η) 1 via gradient ascent from random η range(pairs). Say maximum is λ and maximizer is η.

55 Extracting all isolated local maximizers 1. Start with T := Triples. 2. Find isolated local maximizer of T ( η, η, η) s.t. Pairs( η, η) 1 via gradient ascent from random η range(pairs). Say maximum is λ and maximizer is η. 3. Deflation: replace T with T λ η η η. Goto step 2.

56 Extracting all isolated local maximizers 1. Start with T := Triples. 2. Find isolated local maximizer of T ( η, η, η) s.t. Pairs( η, η) 1 via gradient ascent from random η range(pairs). Say maximum is λ and maximizer is η. 3. Deflation: replace T with T λ η η η. Goto step 2. A variant of this runs in polynomial time (w.h.p.), and is robust to perturbations to Pairs and Triples.

57 General case: asymmetric views Each view v has different set of conditional means {~ µv,1, µ ~ v,2,..., µ ~ v,k } Rdv.

58 General case: asymmetric views Each view v has different set of conditional means {~ µv,1, µ ~ v,2,..., µ ~ v,k } Rdv. Reduction: transform ~x1 and ~x2 to look like ~x3 via linear transformations.

59 Asymmetric cross moments Define asymmetric cross moment: Pairs u,v := E[ x u x v ].

60 Asymmetric cross moments Define asymmetric cross moment: Pairs u,v := E[ x u x v ]. Transforming view v to view 3: C v 3 := E[ x 3 x u ] E[ x v x u ] R d 3 d v where denotes Moore-Penrose pseudoinverse.

61 Asymmetric cross moments Define asymmetric cross moment: Pairs u,v := E[ x u x v ]. Transforming view v to view 3: C v 3 := E[ x 3 x u ] E[ x v x u ] R d 3 d v where denotes Moore-Penrose pseudoinverse. Simple exercise to show E[C v 3 x v h = j] = µ 3,j so C v 3 x v behaves like x 3 (as far as our algorithm can tell).

62 Part 3. Some applications and open questions Multi-view mixture models Multi-view method-of-moments Some applications and open questions Mixtures of Gaussians Hidden Markov models and other models Topic models Open questions Concluding remarks

63 Mixtures of axis-aligned Gaussians Mixture of axis-aligned Gaussian in R n, with component means µ 1, µ 2,..., µ k R n ; no minimum separation requirement. h x 1 x 2 x n

64 Mixtures of axis-aligned Gaussians Mixture of axis-aligned Gaussian in R n, with component means µ 1, µ 2,..., µ k R n ; no minimum separation requirement. h Assumptions: x 1 x 2 x n non-degeneracy: component means span k dim subspace. weak incoherence condition: component means not perfectly aligned with coordinate axes similar to spreading condition of (Chaudhuri-Rao, 08).

65 Mixtures of axis-aligned Gaussians Mixture of axis-aligned Gaussian in R n, with component means µ 1, µ 2,..., µ k R n ; no minimum separation requirement. h Assumptions: x 1 x 2 x n non-degeneracy: component means span k dim subspace. weak incoherence condition: component means not perfectly aligned with coordinate axes similar to spreading condition of (Chaudhuri-Rao, 08). Then, randomly partitioning coordinates into l 3 views guarantees (w.h.p.) that non-degeneracy holds in all l views.

66 Hidden Markov models and others h 1 h 2 h 3 x 1 x 2 x 3

67 Hidden Markov models and others h 1 h 2 h 3 x 1 x 2 x 3 h x 1 x 2 x 3

68 Hidden Markov models and others h 1 h 2 h 3 x 1 x 2 x 3 h x 1 x 2 x 3 Other models: 1. Mixtures of Gaussians (Hsu-Kakade, ITCS 13) 2. HMMs (Anandkumar-Hsu-Kakade, COLT 12) 3. Latent Dirichlet Allocation (Anandkumar-Foster-Hsu-Kakade-Liu, NIPS 12) 4. Latent parse trees (Hsu-Kakade-Liang, NIPS 12) 5. Independent Component Analysis (Arora-Ge-Moitra-Sachdeva, NIPS 12; Hsu-Kakade, ITCS 13)

69 Bag-of-words clustering model ( µ j ) i = Pr[ see word i in document document topic is j ]. Corpus: New York Times (from UCI), articles. Vocabulary size: d = words. Chose k = 50. For each topic j, show top 10 words i.

70 Bag-of-words clustering model ( µ j ) i = Pr[ see word i in document document topic is j ]. Corpus: New York Times (from UCI), articles. Vocabulary size: d = words. Chose k = 50. For each topic j, show top 10 words i. sales run school drug player economic inning student patient tiger_wood consumer hit teacher million won major game program company shot home season official doctor play indicator home public companies round weekly right children percent win order games high cost tournament claim dodger education program tour scheduled left district health right

71 Bag-of-words clustering model palestinian tax cup point yard israel cut minutes game game israeli percent oil team play yasser_arafat bush water shot season peace billion add play team israeli plan tablespoon laker touchdown israelis bill food season quarterback leader taxes teaspoon half coach official million pepper lead defense attack congress sugar games quarter

72 Bag-of-words clustering model percent al_gore car book taliban stock campaign race children attack market president driver ages afghanistan fund george_bush team author official investor bush won read military companies clinton win newspaper u_s analyst vice racing web united_states money presidential track writer terrorist investment million season written war economy democratic lap sales bin

73 Bag-of-words clustering model com court show film music www case network movie song site law season director group web lawyer nbc play part sites federal cb character new_york information government program actor company online decision television show million mail trial series movies band internet microsoft night million show telegram right new_york part album etc.

74 Some open questions What if k > d v? (relevant to overcomplete dictionary learning)

75 Some open questions What if k > d v? (relevant to overcomplete dictionary learning) Apply some non-linear transformations x v f v ( x v )?

76 Some open questions What if k > d v? (relevant to overcomplete dictionary learning) Apply some non-linear transformations x v f v ( x v )? Combine views, e.g., via tensor product x 1,2 := x 1 x 2, x 3,4 := x 3 x 4, x 5,6 := x 5 x 6, etc.?

77 Some open questions What if k > d v? (relevant to overcomplete dictionary learning) Apply some non-linear transformations x v f v ( x v )? Combine views, e.g., via tensor product x 1,2 := x 1 x 2, x 3,4 := x 3 x 4, x 5,6 := x 5 x 6, etc.? Can we relax the multi-view assumption?

78 Some open questions What if k > d v? (relevant to overcomplete dictionary learning) Apply some non-linear transformations x v f v ( x v )? Combine views, e.g., via tensor product x 1,2 := x 1 x 2, x 3,4 := x 3 x 4, x 5,6 := x 5 x 6, etc.? Can we relax the multi-view assumption? Allow for richer hidden state? (e.g., independent component analysis)

79 Some open questions What if k > d v? (relevant to overcomplete dictionary learning) Apply some non-linear transformations x v f v ( x v )? Combine views, e.g., via tensor product x 1,2 := x 1 x 2, x 3,4 := x 3 x 4, x 5,6 := x 5 x 6, etc.? Can we relax the multi-view assumption? Allow for richer hidden state? (e.g., independent component analysis) Gaussianization via random projection?

80 Part 4. Concluding remarks Multi-view mixture models Multi-view method-of-moments Some applications and open questions Concluding remarks

81 Concluding remarks Take-home messages:

82 Concluding remarks Take-home messages: Power of multiple views: Can take advantage of diverse / non-redundant sources of information in unsupervised learning.

83 Concluding remarks Take-home messages: Power of multiple views: Can take advantage of diverse / non-redundant sources of information in unsupervised learning. Overcoming complexity barriers: Some provably hard estimation problems become easy after ruling out degenerate cases.

84 Concluding remarks Take-home messages: Power of multiple views: Can take advantage of diverse / non-redundant sources of information in unsupervised learning. Overcoming complexity barriers: Some provably hard estimation problems become easy after ruling out degenerate cases. Blessing of dimensionality for estimators based on method-of-moments.

85 Thanks! (Co-authors: Anima Anandkumar, Dean Foster, Rong Ge, Sham Kakade, Yi-Kai Liu, Matus Telgarsky)

A Spectral Algorithm for Latent Dirichlet Allocation

A Spectral Algorithm for Latent Dirichlet Allocation Anima Anandkumar University of California Irvine, CA a.anandkumar@uci.edu Dean P. Foster University of Pennsylvania Philadelphia, PA dean@foster.net