Efficient algorithms for estimating multi-view mixture models

Size: px
Start display at page:

Download "Efficient algorithms for estimating multi-view mixture models"

Transcription

1 Efficient algorithms for estimating multi-view mixture models Daniel Hsu Microsoft Research, New England

2 Outline Multi-view mixture models Multi-view method-of-moments Some applications and open questions Concluding remarks

3 Part 1. Multi-view mixture models Multi-view mixture models Unsupervised learning and mixture models Multi-view mixture models Complexity barriers Multi-view method-of-moments Some applications and open questions Concluding remarks

4 Unsupervised learning Many modern applications of machine learning: high-dimensional data from many diverse sources, but mostly unlabeled.

5 Unsupervised learning Many modern applications of machine learning: high-dimensional data from many diverse sources, but mostly unlabeled. Unsupervised learning: extract useful info from this data. Disentangle sub-populations in data source. Discover useful representations for downstream stages of learning pipeline (e.g., supervised learning).

6 Mixture models Simple latent variable model: mixture model h x h [k] := {1, 2,..., k} (hidden); x R d (observed); Pr[ h = j ] = w j ; x h Ph ; so x has a mixture distribution P( x) = w 1 P 1 ( x) + w 2 P 2 ( x) + + w k P k ( x).

7 Mixture models Simple latent variable model: mixture model h x h [k] := {1, 2,..., k} (hidden); x R d (observed); Pr[ h = j ] = w j ; x h Ph ; so x has a mixture distribution P( x) = w 1 P 1 ( x) + w 2 P 2 ( x) + + w k P k ( x). Typical use: learn about constituent sub-populations (e.g., clusters) in data source.

8 Multi-view mixture models Can we take advantage of diverse sources of information?

9 Multi-view mixture models Can we take advantage of diverse sources of information? h [k ], h ~x1 ~x2 ~x` ~x1 Rd1, ~x2 Rd2,..., ~x` Rd`. k = # components, ` = # views (e.g., audio, video, text). View 1: ~x1 Rd1 View 2: ~x2 Rd2 View 3: ~x3 Rd3

10 Multi-view mixture models Can we take advantage of diverse sources of information? h x 1 x 2 x l h [k], x 1 R d 1, x 2 R d 2,..., x l R d l. k = # components, l = # views (e.g., audio, video, text). View 1: x 1 R d 1 View 2: x 2 R d 2 View 3: x 3 R d 3

11 Multi-view mixture models Multi-view assumption: Views are conditionally independent given the component. View 1: x 1 R d 1 View 2: x 2 R d 2 View 3: x 3 R d 3 Larger k (# components): more sub-populations to disentangle. Larger l (# views): more non-redundant sources of information.

12 Semi-parametric estimation task Parameters of component distributions: Mixing weights w j := Pr[h = j], j [k]; Conditional means µ v,j := E[ x v h = j] R dv, j [k], v [l]. Goal: Estimate mixing weights and conditional means from independent copies of ( x 1, x 2,..., x l ).

13 Semi-parametric estimation task Parameters of component distributions: Mixing weights w j := Pr[h = j], j [k]; Conditional means µ v,j := E[ x v h = j] R dv, j [k], v [l]. Goal: Estimate mixing weights and conditional means from independent copies of ( x 1, x 2,..., x l ). Questions: 1. How do we estimate {w j } and { µ v,j } without observing h? 2. How many views l are sufficient to learn with poly(k) computational / sample complexity?

14 Some barriers to efficient estimation Challenge: many difficult parametric estimation tasks reduce to this estimation problem.

15 Some barriers to efficient estimation Challenge: many difficult parametric estimation tasks reduce to this estimation problem. Cryptographic barrier: discrete HMM parameter estimation as hard as learning parity functions with noise (Mossel-Roch, 06).

16 Some barriers to efficient estimation Challenge: many difficult parametric estimation tasks reduce to this estimation problem. Cryptographic barrier: discrete HMM parameter estimation as hard as learning parity functions with noise (Mossel-Roch, 06). Statistical barrier: Gaussian mixtures in R 1 can require exp(ω(k)) samples to estimate parameters, even if components are wellseparated (Moitra-Valiant, 10).

17 Some barriers to efficient estimation Challenge: many difficult parametric estimation tasks reduce to this estimation problem. Cryptographic barrier: discrete HMM parameter estimation as hard as learning parity functions with noise (Mossel-Roch, 06). Statistical barrier: Gaussian mixtures in R 1 can require exp(ω(k)) samples to estimate parameters, even if components are wellseparated (Moitra-Valiant, 10). In practice: resort to local search (e.g., EM), often subject to slow convergence and inaccurate local optima.

18 Making progress: Gaussian mixture model Gaussian mixture model: problem becomes easier if assume some large minimum separation between component means (Dasgupta, 99): sep := min i j µ i µ j max{σ i, σ j }.

19 Making progress: Gaussian mixture model Gaussian mixture model: problem becomes easier if assume some large minimum separation between component means (Dasgupta, 99): sep := min i j µ i µ j max{σ i, σ j }. sep = Ω(d c ): interpoint distance-based methods / EM (Dasgupta, 99; Dasgupta-Schulman, 00; Arora-Kannan, 00) sep = Ω(k c ): first use PCA to k dimensions (Vempala-Wang, 02; Kannan-Salmasian-Vempala, 05; Achlioptas-McSherry, 05) Also works for mixtures of log-concave distributions.

20 Making progress: Gaussian mixture model Gaussian mixture model: problem becomes easier if assume some large minimum separation between component means (Dasgupta, 99): sep := min i j µ i µ j max{σ i, σ j }. sep = Ω(d c ): interpoint distance-based methods / EM (Dasgupta, 99; Dasgupta-Schulman, 00; Arora-Kannan, 00) sep = Ω(k c ): first use PCA to k dimensions (Vempala-Wang, 02; Kannan-Salmasian-Vempala, 05; Achlioptas-McSherry, 05) Also works for mixtures of log-concave distributions. No minimum separation requirement: method-of-moments but exp(ω(k)) running time / sample size (Kalai-Moitra-Valiant, 10; Belkin-Sinha, 10; Moitra-Valiant, 10)

21 Making progress: discrete hidden Markov models Hardness reductions create HMMs with degenerate output and next-state distributions Pr[ x t = h t = 1] Pr[ x t = h t = 2] Pr[ x t = h t = 3]

22 Making progress: discrete hidden Markov models Hardness reductions create HMMs with degenerate output and next-state distributions Pr[ x t = h t = 1] Pr[ x t = h t = 2] Pr[ x t = h t = 3] These instances are avoided by assuming parameter matrices are full-rank (Mossel-Roch, 06; Hsu-Kakade-Zhang, 09)

23 What we do This work: given 3 views, mild non-degeneracy conditions imply efficient algorithms for estimation.

24 What we do This work: given 3 views, mild non-degeneracy conditions imply efficient algorithms for estimation. Non-degeneracy condition for multi-view mixture model: Conditional means { µ v,1, µ v,2,..., µ v,k } are linearly independent for each view v [l], and w > 0. Requires high-dimensional observations (d v k)!

25 What we do This work: given 3 views, mild non-degeneracy conditions imply efficient algorithms for estimation. Non-degeneracy condition for multi-view mixture model: Conditional means { µ v,1, µ v,2,..., µ v,k } are linearly independent for each view v [l], and w > 0. Requires high-dimensional observations (d v k)! New efficient learning guarantees for parametric models (e.g., mixtures of Gaussians, general HMMs)

26 What we do This work: given 3 views, mild non-degeneracy conditions imply efficient algorithms for estimation. Non-degeneracy condition for multi-view mixture model: Conditional means { µ v,1, µ v,2,..., µ v,k } are linearly independent for each view v [l], and w > 0. Requires high-dimensional observations (d v k)! New efficient learning guarantees for parametric models (e.g., mixtures of Gaussians, general HMMs) General tensor decomposition framework applicable to a wide variety of estimation problems.

27 Part 2. Multi-view method-of-moments Multi-view mixture models Multi-view method-of-moments Overview Structure of moments Uniqueness of decomposition Computing the decomposition Asymmetric views Some applications and open questions Concluding remarks

28 The plan I First, assume views are (conditionally) exchangeable, and derive basic algorithm.

29 The plan I First, assume views are (conditionally) exchangeable, and derive basic algorithm. I Then, provide reduction from general multi-view setting to exchangeable case.

30 Simpler case: exchangeable views (Conditionally) exchangeable views: assume the views have the same conditional means, i.e., E[ x v h = j ] µ j, j [k], v [l].

31 Simpler case: exchangeable views (Conditionally) exchangeable views: assume the views have the same conditional means, i.e., E[ x v h = j ] µ j, j [k], v [l]. Motivating setting: bag-of-words model, x 1, x 2,..., x l l exchangeable words in a document. One-hot encoding: x v = e i v-th word in document is i-th word in vocab (where e i {0, 1} d has 1 in i-th position, 0 elsewhere). ( µ j ) i = E[( x v ) i h = j] = Pr[ x v = e i h = j], i [d], j [k].

32 Key ideas 1. Method-of-moments: conditional means are revealed by appropriate low-rank decompositions of moment matrices and tensors. 2. Third-order tensor decomposition is uniquely determined by directions of (locally) maximum skew. 3. The required local optimization can be efficiently performed in poly time.

33 Algebraic structure in moments Recall: E[ x v h = j ] = µ j.

34 Algebraic structure in moments Recall: E[ x v h = j ] = µ j. By conditional independence and exchangeability of x 1, x 2,..., x l given h, Pairs := E[ x 1 x 2 ] = E [ E[ x 1 h] E[ x 2 h] ] = E[ µ h µ h ] k = w i µ i µ i R d d. i=1

35 Algebraic structure in moments Recall: E[ x v h = j ] = µ j. By conditional independence and exchangeability of x 1, x 2,..., x l given h, Pairs := E[ x 1 x 2 ] = E [ E[ x 1 h] E[ x 2 h] ] = E[ µ h µ h ] k = w i µ i µ i R d d. i=1 Triples := E[ x 1 x 2 x 3 ] k = w i µ i µ i µ i R d d d, etc. i=1 (If only we could extract these low-rank decompositions... )

36 2nd moment: subspace spanned by conditional means

37 2nd moment: subspace spanned by conditional means Non-degeneracy assumption ({ µ i } linearly independent)

38 2nd moment: subspace spanned by conditional means Non-degeneracy assumption ({ µ i } linearly independent) = Pairs = k i=1 w i µ i µ i symmetric psd and rank k

39 2nd moment: subspace spanned by conditional means Non-degeneracy assumption ({ µ i } linearly independent) = Pairs = k i=1 w i µ i µ i symmetric psd and rank k = Pairs equips k-dim subspace span{ µ 1, µ 2,..., µ k } with inner product Pairs( x, y) := x Pairs y.

40 2nd moment: subspace spanned by conditional means Non-degeneracy assumption ({ µ i } linearly independent) = Pairs = k i=1 w i µ i µ i symmetric psd and rank k = Pairs equips k-dim subspace span{ µ 1, µ 2,..., µ k } with inner product Pairs( x, y) := x Pairs y. However, { µ i } not generally determined by just Pairs (e.g., { µ i } are not necessarily orthogonal).

41 2nd moment: subspace spanned by conditional means Non-degeneracy assumption ({ µ i } linearly independent) = Pairs = k i=1 w i µ i µ i symmetric psd and rank k = Pairs equips k-dim subspace span{ µ 1, µ 2,..., µ k } with inner product Pairs( x, y) := x Pairs y. However, { µ i } not generally determined by just Pairs (e.g., { µ i } are not necessarily orthogonal). Must look at higher-order moments?

42 3rd moment: (cross) skew maximizers Claim: Up to third-moment (i.e., 3 views) suffices. View Triples: R d R d R d R as trilinear form.

43 3rd moment: (cross) skew maximizers Claim: Up to third-moment (i.e., 3 views) suffices. View Triples: R d R d R d R as trilinear form. Theorem Each isolated local maximizer η of max Triples( η, η, η) s.t. Pairs( η, η) 1 η R d satisfies, for some i [k], Pairs η = w i µ i, Triples( η, η, η ) = 1 wi.

44 3rd moment: (cross) skew maximizers Claim: Up to third-moment (i.e., 3 views) suffices. View Triples: R d R d R d R as trilinear form. Theorem Each isolated local maximizer η of max Triples( η, η, η) s.t. Pairs( η, η) 1 η R d satisfies, for some i [k], Pairs η = w i µ i, Triples( η, η, η ) = 1 wi. Also: these maximizers can be found efficiently and robustly.

45 Variational analysis max Triples( η, η, η) s.t. Pairs( η, η) 1 η R d

46 Variational analysis max Triples( η, η, η) s.t. Pairs( η, η) 1 η R d (Substitute Pairs = k i=1 w i µ i µ i and Triples = k i=1 w i µ i µ i µ i.)

47 Variational analysis max η R d k w i ( η µ i ) 3 i=1 s.t. k w i ( η µ i ) 2 1 (Substitute Pairs = k i=1 w i µ i µ i and Triples = k i=1 w i µ i µ i µ i.) i=1

48 Variational analysis max η R d k w i ( η µ i ) 3 i=1 s.t. k w i ( η µ i ) 2 1 i=1 (Let θ i := w i ( η µ i ) for i [k].)

49 Variational analysis max η R d k i=1 1 wi ( w i η µ i ) 3 s.t. k ( w i η µ i ) 2 1 i=1 (Let θ i := w i ( η µ i ) for i [k].)

50 Variational analysis max θ R k k i=1 1 wi θ 3 i s.t. k θi 2 1 i=1 (Let θ i := w i ( η µ i ) for i [k].)

51 Variational analysis max θ R k k i=1 1 wi θ 3 i s.t. k θi 2 1 i=1 (Let θ i := w i ( η µ i ) for i [k].) Isolated local maximizers θ (found via gradient ascent) are e 1 = (1, 0, 0,... ), e 2 = (0, 1, 0,... ), etc. which means that each η satisfies, for some i [k], wj ( η µ j ) = {1 j = i 0 j i.

52 Variational analysis max θ R k k i=1 1 wi θ 3 i s.t. k θi 2 1 i=1 (Let θ i := w i ( η µ i ) for i [k].) Isolated local maximizers θ (found via gradient ascent) are e 1 = (1, 0, 0,... ), e 2 = (0, 1, 0,... ), etc. which means that each η satisfies, for some i [k], Therefore wj ( η µ j ) = {1 j = i Pairs η = 0 j i. k ( ) w j µ j η µ j = wi µ i. j=1

53 Extracting all isolated local maximizers 1. Start with T := Triples.

54 Extracting all isolated local maximizers 1. Start with T := Triples. 2. Find isolated local maximizer of T ( η, η, η) s.t. Pairs( η, η) 1 via gradient ascent from random η range(pairs). Say maximum is λ and maximizer is η.

55 Extracting all isolated local maximizers 1. Start with T := Triples. 2. Find isolated local maximizer of T ( η, η, η) s.t. Pairs( η, η) 1 via gradient ascent from random η range(pairs). Say maximum is λ and maximizer is η. 3. Deflation: replace T with T λ η η η. Goto step 2.

56 Extracting all isolated local maximizers 1. Start with T := Triples. 2. Find isolated local maximizer of T ( η, η, η) s.t. Pairs( η, η) 1 via gradient ascent from random η range(pairs). Say maximum is λ and maximizer is η. 3. Deflation: replace T with T λ η η η. Goto step 2. A variant of this runs in polynomial time (w.h.p.), and is robust to perturbations to Pairs and Triples.

57 General case: asymmetric views Each view v has different set of conditional means {~ µv,1, µ ~ v,2,..., µ ~ v,k } Rdv.

58 General case: asymmetric views Each view v has different set of conditional means {~ µv,1, µ ~ v,2,..., µ ~ v,k } Rdv. Reduction: transform ~x1 and ~x2 to look like ~x3 via linear transformations.

59 Asymmetric cross moments Define asymmetric cross moment: Pairs u,v := E[ x u x v ].

60 Asymmetric cross moments Define asymmetric cross moment: Pairs u,v := E[ x u x v ]. Transforming view v to view 3: C v 3 := E[ x 3 x u ] E[ x v x u ] R d 3 d v where denotes Moore-Penrose pseudoinverse.

61 Asymmetric cross moments Define asymmetric cross moment: Pairs u,v := E[ x u x v ]. Transforming view v to view 3: C v 3 := E[ x 3 x u ] E[ x v x u ] R d 3 d v where denotes Moore-Penrose pseudoinverse. Simple exercise to show E[C v 3 x v h = j] = µ 3,j so C v 3 x v behaves like x 3 (as far as our algorithm can tell).

62 Part 3. Some applications and open questions Multi-view mixture models Multi-view method-of-moments Some applications and open questions Mixtures of Gaussians Hidden Markov models and other models Topic models Open questions Concluding remarks

63 Mixtures of axis-aligned Gaussians Mixture of axis-aligned Gaussian in R n, with component means µ 1, µ 2,..., µ k R n ; no minimum separation requirement. h x 1 x 2 x n

64 Mixtures of axis-aligned Gaussians Mixture of axis-aligned Gaussian in R n, with component means µ 1, µ 2,..., µ k R n ; no minimum separation requirement. h Assumptions: x 1 x 2 x n non-degeneracy: component means span k dim subspace. weak incoherence condition: component means not perfectly aligned with coordinate axes similar to spreading condition of (Chaudhuri-Rao, 08).

65 Mixtures of axis-aligned Gaussians Mixture of axis-aligned Gaussian in R n, with component means µ 1, µ 2,..., µ k R n ; no minimum separation requirement. h Assumptions: x 1 x 2 x n non-degeneracy: component means span k dim subspace. weak incoherence condition: component means not perfectly aligned with coordinate axes similar to spreading condition of (Chaudhuri-Rao, 08). Then, randomly partitioning coordinates into l 3 views guarantees (w.h.p.) that non-degeneracy holds in all l views.

66 Hidden Markov models and others h 1 h 2 h 3 x 1 x 2 x 3

67 Hidden Markov models and others h 1 h 2 h 3 x 1 x 2 x 3 h x 1 x 2 x 3

68 Hidden Markov models and others h 1 h 2 h 3 x 1 x 2 x 3 h x 1 x 2 x 3 Other models: 1. Mixtures of Gaussians (Hsu-Kakade, ITCS 13) 2. HMMs (Anandkumar-Hsu-Kakade, COLT 12) 3. Latent Dirichlet Allocation (Anandkumar-Foster-Hsu-Kakade-Liu, NIPS 12) 4. Latent parse trees (Hsu-Kakade-Liang, NIPS 12) 5. Independent Component Analysis (Arora-Ge-Moitra-Sachdeva, NIPS 12; Hsu-Kakade, ITCS 13)

69 Bag-of-words clustering model ( µ j ) i = Pr[ see word i in document document topic is j ]. Corpus: New York Times (from UCI), articles. Vocabulary size: d = words. Chose k = 50. For each topic j, show top 10 words i.

70 Bag-of-words clustering model ( µ j ) i = Pr[ see word i in document document topic is j ]. Corpus: New York Times (from UCI), articles. Vocabulary size: d = words. Chose k = 50. For each topic j, show top 10 words i. sales run school drug player economic inning student patient tiger_wood consumer hit teacher million won major game program company shot home season official doctor play indicator home public companies round weekly right children percent win order games high cost tournament claim dodger education program tour scheduled left district health right

71 Bag-of-words clustering model palestinian tax cup point yard israel cut minutes game game israeli percent oil team play yasser_arafat bush water shot season peace billion add play team israeli plan tablespoon laker touchdown israelis bill food season quarterback leader taxes teaspoon half coach official million pepper lead defense attack congress sugar games quarter

72 Bag-of-words clustering model percent al_gore car book taliban stock campaign race children attack market president driver ages afghanistan fund george_bush team author official investor bush won read military companies clinton win newspaper u_s analyst vice racing web united_states money presidential track writer terrorist investment million season written war economy democratic lap sales bin

73 Bag-of-words clustering model com court show film music www case network movie song site law season director group web lawyer nbc play part sites federal cb character new_york information government program actor company online decision television show million mail trial series movies band internet microsoft night million show telegram right new_york part album etc.

74 Some open questions What if k > d v? (relevant to overcomplete dictionary learning)

75 Some open questions What if k > d v? (relevant to overcomplete dictionary learning) Apply some non-linear transformations x v f v ( x v )?

76 Some open questions What if k > d v? (relevant to overcomplete dictionary learning) Apply some non-linear transformations x v f v ( x v )? Combine views, e.g., via tensor product x 1,2 := x 1 x 2, x 3,4 := x 3 x 4, x 5,6 := x 5 x 6, etc.?

77 Some open questions What if k > d v? (relevant to overcomplete dictionary learning) Apply some non-linear transformations x v f v ( x v )? Combine views, e.g., via tensor product x 1,2 := x 1 x 2, x 3,4 := x 3 x 4, x 5,6 := x 5 x 6, etc.? Can we relax the multi-view assumption?

78 Some open questions What if k > d v? (relevant to overcomplete dictionary learning) Apply some non-linear transformations x v f v ( x v )? Combine views, e.g., via tensor product x 1,2 := x 1 x 2, x 3,4 := x 3 x 4, x 5,6 := x 5 x 6, etc.? Can we relax the multi-view assumption? Allow for richer hidden state? (e.g., independent component analysis)

79 Some open questions What if k > d v? (relevant to overcomplete dictionary learning) Apply some non-linear transformations x v f v ( x v )? Combine views, e.g., via tensor product x 1,2 := x 1 x 2, x 3,4 := x 3 x 4, x 5,6 := x 5 x 6, etc.? Can we relax the multi-view assumption? Allow for richer hidden state? (e.g., independent component analysis) Gaussianization via random projection?

80 Part 4. Concluding remarks Multi-view mixture models Multi-view method-of-moments Some applications and open questions Concluding remarks

81 Concluding remarks Take-home messages:

82 Concluding remarks Take-home messages: Power of multiple views: Can take advantage of diverse / non-redundant sources of information in unsupervised learning.

83 Concluding remarks Take-home messages: Power of multiple views: Can take advantage of diverse / non-redundant sources of information in unsupervised learning. Overcoming complexity barriers: Some provably hard estimation problems become easy after ruling out degenerate cases.

84 Concluding remarks Take-home messages: Power of multiple views: Can take advantage of diverse / non-redundant sources of information in unsupervised learning. Overcoming complexity barriers: Some provably hard estimation problems become easy after ruling out degenerate cases. Blessing of dimensionality for estimators based on method-of-moments.

85 Thanks! (Co-authors: Anima Anandkumar, Dean Foster, Rong Ge, Sham Kakade, Yi-Kai Liu, Matus Telgarsky)

A Spectral Algorithm for Latent Dirichlet Allocation

A Spectral Algorithm for Latent Dirichlet Allocation A Spectral Algorithm for Latent Dirichlet Allocation Anima Anandkumar University of California Irvine, CA a.anandkumar@uci.edu Dean P. Foster University of Pennsylvania Philadelphia, PA dean@foster.net

More information

Computational Lower Bounds for Statistical Estimation Problems

Computational Lower Bounds for Statistical Estimation Problems Computational Lower Bounds for Statistical Estimation Problems Ilias Diakonikolas (USC) (joint with Daniel Kane (UCSD) and Alistair Stewart (USC)) Workshop on Local Algorithms, MIT, June 2018 THIS TALK

More information

Estimating Latent Variable Graphical Models with Moments and Likelihoods

Estimating Latent Variable Graphical Models with Moments and Likelihoods Estimating Latent Variable Graphical Models with Moments and Likelihoods Arun Tejasvi Chaganty Percy Liang Stanford University June 18, 2014 Chaganty, Liang (Stanford University) Moments and Likelihoods

More information

Orthogonal tensor decomposition

Orthogonal tensor decomposition Orthogonal tensor decomposition Daniel Hsu Columbia University Largely based on 2012 arxiv report Tensor decompositions for learning latent variable models, with Anandkumar, Ge, Kakade, and Telgarsky.

More information

Guaranteed Learning of Latent Variable Models through Spectral and Tensor Methods

Guaranteed Learning of Latent Variable Models through Spectral and Tensor Methods Guaranteed Learning of Latent Variable Models through Spectral and Tensor Methods Anima Anandkumar U.C. Irvine Application 1: Clustering Basic operation of grouping data points. Hypothesis: each data point

More information

ECE 598: Representation Learning: Algorithms and Models Fall 2017

ECE 598: Representation Learning: Algorithms and Models Fall 2017 ECE 598: Representation Learning: Algorithms and Models Fall 2017 Lecture 1: Tensor Methods in Machine Learning Lecturer: Pramod Viswanathan Scribe: Bharath V Raghavan, Oct 3, 2017 11 Introduction Tensors

More information

Identifiability and Learning of Topic Models: Tensor Decompositions under Structural Constraints

Identifiability and Learning of Topic Models: Tensor Decompositions under Structural Constraints Identifiability and Learning of Topic Models: Tensor Decompositions under Structural Constraints Anima Anandkumar U.C. Irvine Joint work with Daniel Hsu, Majid Janzamin Adel Javanmard and Sham Kakade.

More information

Appendix A. Proof to Theorem 1

Appendix A. Proof to Theorem 1 Appendix A Proof to Theorem In this section, we prove the sample complexity bound given in Theorem The proof consists of three main parts In Appendix A, we prove perturbation lemmas that bound the estimation

More information

Donald Goldfarb IEOR Department Columbia University UCLA Mathematics Department Distinguished Lecture Series May 17 19, 2016

Donald Goldfarb IEOR Department Columbia University UCLA Mathematics Department Distinguished Lecture Series May 17 19, 2016 Optimization for Tensor Models Donald Goldfarb IEOR Department Columbia University UCLA Mathematics Department Distinguished Lecture Series May 17 19, 2016 1 Tensors Matrix Tensor: higher-order matrix

More information

Learning Mixtures of Spherical Gaussians: Moment Methods and Spectral Decompositions

Learning Mixtures of Spherical Gaussians: Moment Methods and Spectral Decompositions Learning Mixtures of Spherical Gaussians: Moment Methods and Spectral Decompositions [Extended Abstract] ABSTRACT Daniel Hsu Microsoft Research New England dahsu@microsoft.com This work provides a computationally

More information

Efficient Spectral Methods for Learning Mixture Models!

Efficient Spectral Methods for Learning Mixture Models! Efficient Spectral Methods for Learning Mixture Models Qingqing Huang 2016 February Laboratory for Information & Decision Systems Based on joint works with Munther Dahleh, Rong Ge, Sham Kakade, Greg Valiant.

More information

Disentangling Gaussians

Disentangling Gaussians Disentangling Gaussians Ankur Moitra, MIT November 6th, 2014 Dean s Breakfast Algorithmic Aspects of Machine Learning 2015 by Ankur Moitra. Note: These are unpolished, incomplete course notes. Developed

More information

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017 Sum-Product Networks STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017 Introduction Outline What is a Sum-Product Network? Inference Applications In more depth

More information

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated Fall 3 Computer Vision Overview of Statistical Tools Statistical Inference Haibin Ling Observation inference Decision Prior knowledge http://www.dabi.temple.edu/~hbling/teaching/3f_5543/index.html Bayesian

More information

A Method of Moments for Mixture Models and Hidden Markov Models

A Method of Moments for Mixture Models and Hidden Markov Models JMLR: Workshop and Conference Proceedings vol 23 (2012) 33.1 33.34 25th Annual Conference on Learning Theory A Method of Moments for Mixture Models and Hidden Markov Models Animashree Anandkumar University

More information

The Informativeness of k-means for Learning Mixture Models

The Informativeness of k-means for Learning Mixture Models The Informativeness of k-means for Learning Mixture Models Vincent Y. F. Tan (Joint work with Zhaoqiang Liu) National University of Singapore June 18, 2018 1/35 Gaussian distribution For F dimensions,

More information

arxiv: v4 [cs.lg] 13 Nov 2014

arxiv: v4 [cs.lg] 13 Nov 2014 Journal of Machine Learning Research 15 (2014) 2773-2832 Submitted 2/13; Revised 3/14; Published 8/14 Tensor Decompositions for Learning Latent Variable Models arxiv:1210.7559v4 [cs.lg] 13 Nov 2014 Animashree

More information

Non-convex Robust PCA: Provable Bounds

Non-convex Robust PCA: Provable Bounds Non-convex Robust PCA: Provable Bounds Anima Anandkumar U.C. Irvine Joint work with Praneeth Netrapalli, U.N. Niranjan, Prateek Jain and Sujay Sanghavi. Learning with Big Data High Dimensional Regime Missing

More information

Latent Dirichlet Allocation

Latent Dirichlet Allocation Outlines Advanced Artificial Intelligence October 1, 2009 Outlines Part I: Theoretical Background Part II: Application and Results 1 Motive Previous Research Exchangeability 2 Notation and Terminology

More information

Learning mixtures of spherical Gaussians: moment methods and spectral decompositions

Learning mixtures of spherical Gaussians: moment methods and spectral decompositions Learning mixtures of spherical Gaussians: moment methods and spectral decompositions Daniel Hsu and Sham M. Kakade Microsoft Research New England October 8, 01 Abstract This work provides a computationally

More information

TENSOR DECOMPOSITIONS AND THEIR APPLICATIONS

TENSOR DECOMPOSITIONS AND THEIR APPLICATIONS TENSOR DECOMPOSITIONS AND THEIR APPLICATIONS ANKUR MOITRA MASSACHUSETTS INSTITUTE OF TECHNOLOGY SPEARMAN S HYPOTHESIS Charles Spearman (1904): There are two types of intelligence, educ%ve and reproduc%ve

More information

Global Analysis of Expectation Maximization for Mixtures of Two Gaussians

Global Analysis of Expectation Maximization for Mixtures of Two Gaussians Global Analysis of Expectation Maximization for Mixtures of Two Gaussians Ji Xu Columbia University jixu@cs.columbia.edu Daniel Hsu Columbia University djhsu@cs.columbia.edu Arian Maleki Columbia University

More information

Learning Tractable Graphical Models: Latent Trees and Tree Mixtures

Learning Tractable Graphical Models: Latent Trees and Tree Mixtures Learning Tractable Graphical Models: Latent Trees and Tree Mixtures Anima Anandkumar U.C. Irvine Joint work with Furong Huang, U.N. Niranjan, Daniel Hsu and Sham Kakade. High-Dimensional Graphical Modeling

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

An Introduction to Spectral Learning

An Introduction to Spectral Learning An Introduction to Spectral Learning Hanxiao Liu November 8, 2013 Outline 1 Method of Moments 2 Learning topic models using spectral properties 3 Anchor words Preliminaries X 1,, X n p (x; θ), θ = (θ 1,

More information

Dictionary Learning Using Tensor Methods

Dictionary Learning Using Tensor Methods Dictionary Learning Using Tensor Methods Anima Anandkumar U.C. Irvine Joint work with Rong Ge, Majid Janzamin and Furong Huang. Feature learning as cornerstone of ML ML Practice Feature learning as cornerstone

More information

THE HIDDEN CONVEXITY OF SPECTRAL CLUSTERING

THE HIDDEN CONVEXITY OF SPECTRAL CLUSTERING THE HIDDEN CONVEXITY OF SPECTRAL CLUSTERING Luis Rademacher, Ohio State University, Computer Science and Engineering. Joint work with Mikhail Belkin and James Voss This talk A new approach to multi-way

More information

A Spectral Algorithm for Latent Dirichlet Allocation

A Spectral Algorithm for Latent Dirichlet Allocation Noname manuscript No. (will be inserted by the editor) A Spectral Algorithm for Latent Dirichlet Allocation Anima Anandkumar Dean P. Foster Daniel Hsu Sham M. Kakade Yi-Kai Liu Received: date / Accepted:

More information

Learning Linear Bayesian Networks with Latent Variables

Learning Linear Bayesian Networks with Latent Variables Learning Linear Bayesian Networks with Latent Variables Adel Javanmard Stanford University joint work with Anima Anandkumar, Daniel Hsu y, Sham Kakade y University of California, Irvine y Microsoft Research,

More information

A Robust Form of Kruskal s Identifiability Theorem

A Robust Form of Kruskal s Identifiability Theorem A Robust Form of Kruskal s Identifiability Theorem Aditya Bhaskara (Google NYC) Joint work with Moses Charikar (Princeton University) Aravindan Vijayaraghavan (NYU Northwestern) Background: understanding

More information

Dimension Reduction (PCA, ICA, CCA, FLD,

Dimension Reduction (PCA, ICA, CCA, FLD, Dimension Reduction (PCA, ICA, CCA, FLD, Topic Models) Yi Zhang 10-701, Machine Learning, Spring 2011 April 6 th, 2011 Parts of the PCA slides are from previous 10-701 lectures 1 Outline Dimension reduction

More information

COMS 4771 Lecture Course overview 2. Maximum likelihood estimation (review of some statistics)

COMS 4771 Lecture Course overview 2. Maximum likelihood estimation (review of some statistics) COMS 4771 Lecture 1 1. Course overview 2. Maximum likelihood estimation (review of some statistics) 1 / 24 Administrivia This course Topics http://www.satyenkale.com/coms4771/ 1. Supervised learning Core

More information

Sparse Principal Component Analysis via Alternating Maximization and Efficient Parallel Implementations

Sparse Principal Component Analysis via Alternating Maximization and Efficient Parallel Implementations Sparse Principal Component Analysis via Alternating Maximization and Efficient Parallel Implementations Martin Takáč The University of Edinburgh Joint work with Peter Richtárik (Edinburgh University) Selin

More information

arxiv: v1 [cs.lg] 13 Aug 2013

arxiv: v1 [cs.lg] 13 Aug 2013 When are Overcomplete Topic Models Identifiable? Uniqueness of Tensor Tucker Decompositions with Structured Sparsity arxiv:1308.2853v1 [cs.lg] 13 Aug 2013 Animashree Anandkumar, Daniel Hsu, Majid Janzamin

More information

Tensor Methods for Feature Learning

Tensor Methods for Feature Learning Tensor Methods for Feature Learning Anima Anandkumar U.C. Irvine Feature Learning For Efficient Classification Find good transformations of input for improved classification Figures used attributed to

More information

Guaranteed Learning of Latent Variable Models through Tensor Methods

Guaranteed Learning of Latent Variable Models through Tensor Methods Guaranteed Learning of Latent Variable Models through Tensor Methods Furong Huang University of Maryland furongh@cs.umd.edu ACM SIGMETRICS Tutorial 2018 1/75 Tutorial Topic Learning algorithms for latent

More information

Unsupervised Learning

Unsupervised Learning The Complexity of Unsupervised Learning Santosh Vempala, Georgia Tech Unsupervised learning } Data (imagine cool images here) } with no labels (or teachers) } How to } Understand it/find patterns in it?

More information

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012 Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Principal Components Analysis Le Song Lecture 22, Nov 13, 2012 Based on slides from Eric Xing, CMU Reading: Chap 12.1, CB book 1 2 Factor or Component

More information

Diversifying Restricted Boltzmann Machine for Document Modeling. Pengtao Xie. Joint work with Yuntian Deng and Eric Xing Carnegie Mellon University

Diversifying Restricted Boltzmann Machine for Document Modeling. Pengtao Xie. Joint work with Yuntian Deng and Eric Xing Carnegie Mellon University Diversifying Restricted Boltzmann Machine for Document Modeling Pengtao Xie Joint work with Yuntian Deng and Eric Xing Carnegie Mellon University 1 Document Modeling Politics nimal Food 2 Document Modeling

More information

Dimensionality Reduction and Principle Components Analysis

Dimensionality Reduction and Principle Components Analysis Dimensionality Reduction and Principle Components Analysis 1 Outline What is dimensionality reduction? Principle Components Analysis (PCA) Example (Bishop, ch 12) PCA vs linear regression PCA as a mixture

More information

PCA, Kernel PCA, ICA

PCA, Kernel PCA, ICA PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per

More information

AN INTRODUCTION TO TOPIC MODELS

AN INTRODUCTION TO TOPIC MODELS AN INTRODUCTION TO TOPIC MODELS Michael Paul December 4, 2013 600.465 Natural Language Processing Johns Hopkins University Prof. Jason Eisner Making sense of text Suppose you want to learn something about

More information

Algorithms for Generalized Topic Modeling

Algorithms for Generalized Topic Modeling Algorithms for Generalized Topic Modeling Avrim Blum Toyota Technological Institute at Chicago avrim@ttic.edu Nika Haghtalab Computer Science Department Carnegie Mellon University nhaghtal@cs.cmu.edu Abstract

More information

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?

More information

Estimating Covariance Using Factorial Hidden Markov Models

Estimating Covariance Using Factorial Hidden Markov Models Estimating Covariance Using Factorial Hidden Markov Models João Sedoc 1,2 with: Jordan Rodu 3, Lyle Ungar 1, Dean Foster 1 and Jean Gallier 1 1 University of Pennsylvania Philadelphia, PA joao@cis.upenn.edu

More information

Topic Models. Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW. Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1

Topic Models. Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW. Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1 Topic Models Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1 Low-Dimensional Space for Documents Last time: embedding space

More information

CS534 Machine Learning - Spring Final Exam

CS534 Machine Learning - Spring Final Exam CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the

More information

Estimating Latent-Variable Graphical Models using Moments and Likelihoods

Estimating Latent-Variable Graphical Models using Moments and Likelihoods Arun Tejasvi Chaganty Percy Liang Stanford University, Stanford, CA, USA CHAGANTY@CS.STANFORD.EDU PLIANG@CS.STANFORD.EDU Abstract Recent work on the method of moments enable consistent parameter estimation,

More information

Probabilistic Time Series Classification

Probabilistic Time Series Classification Probabilistic Time Series Classification Y. Cem Sübakan Boğaziçi University 25.06.2013 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 1 / 54 Problem Statement The goal is to assign

More information

Online Nonnegative Matrix Factorization with General Divergences

Online Nonnegative Matrix Factorization with General Divergences Online Nonnegative Matrix Factorization with General Divergences Vincent Y. F. Tan (ECE, Mathematics, NUS) Joint work with Renbo Zhao (NUS) and Huan Xu (GeorgiaTech) IWCT, Shanghai Jiaotong University

More information

Recent Advances in Ranking: Adversarial Respondents and Lower Bounds on the Bayes Risk

Recent Advances in Ranking: Adversarial Respondents and Lower Bounds on the Bayes Risk Recent Advances in Ranking: Adversarial Respondents and Lower Bounds on the Bayes Risk Vincent Y. F. Tan Department of Electrical and Computer Engineering, Department of Mathematics, National University

More information

Online Forest Density Estimation

Online Forest Density Estimation Online Forest Density Estimation Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr UAI 16 1 Outline 1 Probabilistic Graphical Models 2 Online Density Estimation 3 Online Forest Density

More information

CS 6375 Machine Learning

CS 6375 Machine Learning CS 6375 Machine Learning Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate Course Info. Instructor: Nicholas Ruozzi Office: ECSS 3.409 Office hours: Tues.

More information

Beyond Log-concavity: Provable Guarantees for Sampling Multi-modal Distributions using Simulated Tempering Langevin Monte Carlo

Beyond Log-concavity: Provable Guarantees for Sampling Multi-modal Distributions using Simulated Tempering Langevin Monte Carlo Beyond Log-concavity: Provable Guarantees for Sampling Multi-modal Distributions using Simulated Tempering Langevin Monte Carlo Rong Ge *, Holden Lee, Andrej Risteski Abstract A key task in Bayesian statistics

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models

More information

Lecture 21: Spectral Learning for Graphical Models

Lecture 21: Spectral Learning for Graphical Models 10-708: Probabilistic Graphical Models 10-708, Spring 2016 Lecture 21: Spectral Learning for Graphical Models Lecturer: Eric P. Xing Scribes: Maruan Al-Shedivat, Wei-Cheng Chang, Frederick Liu 1 Motivation

More information

Short Course Robust Optimization and Machine Learning. Lecture 4: Optimization in Unsupervised Learning

Short Course Robust Optimization and Machine Learning. Lecture 4: Optimization in Unsupervised Learning Short Course Robust Optimization and Machine Machine Lecture 4: Optimization in Unsupervised Laurent El Ghaoui EECS and IEOR Departments UC Berkeley Spring seminar TRANSP-OR, Zinal, Jan. 16-19, 2012 s

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Mark Schmidt University of British Columbia Winter 2018 Last Time: Learning and Inference in DAGs We discussed learning in DAG models, log p(x W ) = n d log p(x i j x i pa(j),

More information

Computational Linear Algebra

Computational Linear Algebra Computational Linear Algebra PD Dr. rer. nat. habil. Ralf-Peter Mundani Computation in Engineering / BGU Scientific Computing in Computer Science / INF Winter Term 2018/19 Part 6: Some Other Stuff PD Dr.

More information

Tensor Decompositions for Machine Learning. G. Roeder 1. UBC Machine Learning Reading Group, June University of British Columbia

Tensor Decompositions for Machine Learning. G. Roeder 1. UBC Machine Learning Reading Group, June University of British Columbia Network Feature s Decompositions for Machine Learning 1 1 Department of Computer Science University of British Columbia UBC Machine Learning Group, June 15 2016 1/30 Contact information Network Feature

More information

Learning Mixtures of Gaussians in High Dimensions

Learning Mixtures of Gaussians in High Dimensions Learning Mixtures of Gaussians in High Dimensions [Extended Abstract] Rong Ge Microsoft Research, New England Qingqing Huang MIT, EECS Sham M. Kakade Microsoft Research, New England ABSTRACT Efficiently

More information

Provable Alternating Minimization Methods for Non-convex Optimization

Provable Alternating Minimization Methods for Non-convex Optimization Provable Alternating Minimization Methods for Non-convex Optimization Prateek Jain Microsoft Research, India Joint work with Praneeth Netrapalli, Sujay Sanghavi, Alekh Agarwal, Animashree Anandkumar, Rashish

More information

ECE 5984: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning ECE 5984: Introduction to Machine Learning Topics: (Finish) Expectation Maximization Principal Component Analysis (PCA) Readings: Barber 15.1-15.4 Dhruv Batra Virginia Tech Administrativia Poster Presentation:

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

arxiv: v1 [stat.ml] 11 Dec 2016

arxiv: v1 [stat.ml] 11 Dec 2016 A New Spectral Method for Latent Variable Models Matteo Ruffini, Marta Casanellas, and Ricard Gavaldà Universitat Politècnica de Catalunya arxiv:161.03409v1 [stat.ml] 11 Dec 016 December 13, 016 Abstract

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

CS Lecture 18. Topic Models and LDA

CS Lecture 18. Topic Models and LDA CS 6347 Lecture 18 Topic Models and LDA (some slides by David Blei) Generative vs. Discriminative Models Recall that, in Bayesian networks, there could be many different, but equivalent models of the same

More information

Lecture 22 Exploratory Text Analysis & Topic Models

Lecture 22 Exploratory Text Analysis & Topic Models Lecture 22 Exploratory Text Analysis & Topic Models Intro to NLP, CS585, Fall 2014 http://people.cs.umass.edu/~brenocon/inlp2014/ Brendan O Connor [Some slides borrowed from Michael Paul] 1 Text Corpus

More information

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin 1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)

More information

Local Maxima in the Likelihood of Gaussian Mixture Models: Structural Results and Algorithmic Consequences

Local Maxima in the Likelihood of Gaussian Mixture Models: Structural Results and Algorithmic Consequences Local Maxima in the Likelihood of Gaussian Mixture Models: Structural Results and Algorithmic Consequences Chi Jin UC Berkeley chijin@cs.berkeley.edu Yuchen Zhang UC Berkeley yuczhang@berkeley.edu Sivaraman

More information

Sum-of-Squares Method, Tensor Decomposition, Dictionary Learning

Sum-of-Squares Method, Tensor Decomposition, Dictionary Learning Sum-of-Squares Method, Tensor Decomposition, Dictionary Learning David Steurer Cornell Approximation Algorithms and Hardness, Banff, August 2014 for many problems (e.g., all UG-hard ones): better guarantees

More information

Isotropic PCA and Affine-Invariant Clustering

Isotropic PCA and Affine-Invariant Clustering Isotropic PCA and Affine-Invariant Clustering (Extended Abstract) S. Charles Brubaker Santosh S. Vempala Georgia Institute of Technology Atlanta, GA 30332 {brubaker,vempala}@cc.gatech.edu Abstract We present

More information

Topic Modeling Using Latent Dirichlet Allocation (LDA)

Topic Modeling Using Latent Dirichlet Allocation (LDA) Topic Modeling Using Latent Dirichlet Allocation (LDA) Porter Jenkins and Mimi Brinberg Penn State University prj3@psu.edu mjb6504@psu.edu October 23, 2017 Porter Jenkins and Mimi Brinberg (PSU) LDA October

More information

Spectral Learning of Hidden Markov Models with Group Persistence

Spectral Learning of Hidden Markov Models with Group Persistence 000 00 00 00 00 00 00 007 008 009 00 0 0 0 0 0 0 07 08 09 00 0 0 0 0 0 0 07 08 09 00 0 0 0 0 0 0 07 08 09 00 0 0 0 0 0 0 07 08 09 00 0 0 0 0 Abstract In this paper, we develop a general Method of Moments

More information

Tensor Decomposition by Modified BCM Neurons Finds Mixture Means Through Input Triplets

Tensor Decomposition by Modified BCM Neurons Finds Mixture Means Through Input Triplets Tensor Decomposition by Modified BCM Neurons Finds Mixture Means Through Input Triplets Matthew Lawlor Applied Mathematics Yale University New Haven, CT 0650 matthew.lawlor@yale.edu Steven W. Zucer Computer

More information

Maximum Likelihood Estimation for Mixtures of Spherical Gaussians is NP-hard

Maximum Likelihood Estimation for Mixtures of Spherical Gaussians is NP-hard Journal of Machine Learning Research 18 018 1-11 Submitted 1/16; Revised 1/16; Published 4/18 Maximum Likelihood Estimation for Mixtures of Spherical Gaussians is NP-hard Christopher Tosh Sanjoy Dasgupta

More information

date: math analysis 2 chapter 18: curve fitting and models

date: math analysis 2 chapter 18: curve fitting and models name: period: date: math analysis 2 mr. mellina chapter 18: curve fitting and models Sections: 18.1 Introduction to Curve Fitting; the Least-Squares Line 18.2 Fitting Exponential Curves 18.3 Fitting Power

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD DATA MINING LECTURE 8 Dimensionality Reduction PCA -- SVD The curse of dimensionality Real data usually have thousands, or millions of dimensions E.g., web documents, where the dimensionality is the vocabulary

More information

CS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs

CS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs CS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs Professor Erik Sudderth Brown University Computer Science October 6, 2016 Some figures and materials courtesy

More information

Fourier PCA. Navin Goyal (MSR India), Santosh Vempala (Georgia Tech) and Ying Xiao (Georgia Tech)

Fourier PCA. Navin Goyal (MSR India), Santosh Vempala (Georgia Tech) and Ying Xiao (Georgia Tech) Fourier PCA Navin Goyal (MSR India), Santosh Vempala (Georgia Tech) and Ying Xiao (Georgia Tech) Introduction 1. Describe a learning problem. 2. Develop an efficient tensor decomposition. Independent component

More information

CS264: Beyond Worst-Case Analysis Lecture #15: Topic Modeling and Nonnegative Matrix Factorization

CS264: Beyond Worst-Case Analysis Lecture #15: Topic Modeling and Nonnegative Matrix Factorization CS264: Beyond Worst-Case Analysis Lecture #15: Topic Modeling and Nonnegative Matrix Factorization Tim Roughgarden February 28, 2017 1 Preamble This lecture fulfills a promise made back in Lecture #1,

More information

Expectation Maximization, and Learning from Partly Unobserved Data (part 2)

Expectation Maximization, and Learning from Partly Unobserved Data (part 2) Expectation Maximization, and Learning from Partly Unobserved Data (part 2) Machine Learning 10-701 April 2005 Tom M. Mitchell Carnegie Mellon University Clustering Outline K means EM: Mixture of Gaussians

More information

Sample and Computationally Efficient Active Learning. Maria-Florina Balcan Carnegie Mellon University

Sample and Computationally Efficient Active Learning. Maria-Florina Balcan Carnegie Mellon University Sample and Computationally Efficient Active Learning Maria-Florina Balcan Carnegie Mellon University Machine Learning is Shaping the World Highly successful discipline with lots of applications. Computational

More information

Generative Clustering, Topic Modeling, & Bayesian Inference

Generative Clustering, Topic Modeling, & Bayesian Inference Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week

More information

Non-Parametric Bayes

Non-Parametric Bayes Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian

More information

26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G.

26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G. 10-708: Probabilistic Graphical Models, Spring 2015 26 : Spectral GMs Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G. 1 Introduction A common task in machine learning is to work with

More information

Recent Advances in Non-Convex Optimization and its Implications to Learning

Recent Advances in Non-Convex Optimization and its Implications to Learning Recent Advances in Non-Convex Optimization and its Implications to Learning Anima Anandkumar.. U.C. Irvine.. ICML 2016 Tutorial Optimization at the heart of Machine Learning Unsupervised Learning Most

More information

Factor Analysis (10/2/13)

Factor Analysis (10/2/13) STA561: Probabilistic machine learning Factor Analysis (10/2/13) Lecturer: Barbara Engelhardt Scribes: Li Zhu, Fan Li, Ni Guan Factor Analysis Factor analysis is related to the mixture models we have studied.

More information

Foundations For Learning in the Age of Big Data. Maria-Florina Balcan

Foundations For Learning in the Age of Big Data. Maria-Florina Balcan Foundations For Learning in the Age of Big Data Maria-Florina Balcan Modern Machine Learning New applications Explosion of data Classic Paradigm Insufficient Nowadays Modern applications: massive amounts

More information

Advanced Introduction to Machine Learning

Advanced Introduction to Machine Learning 10-715 Advanced Introduction to Machine Learning Homework 3 Due Nov 12, 10.30 am Rules 1. Homework is due on the due date at 10.30 am. Please hand over your homework at the beginning of class. Please see

More information

Reduced-Rank Hidden Markov Models

Reduced-Rank Hidden Markov Models Reduced-Rank Hidden Markov Models Sajid M. Siddiqi Byron Boots Geoffrey J. Gordon Carnegie Mellon University ... x 1 x 2 x 3 x τ y 1 y 2 y 3 y τ Sequence of observations: Y =[y 1 y 2 y 3... y τ ] Assume

More information

Topic Models and Applications to Short Documents

Topic Models and Applications to Short Documents Topic Models and Applications to Short Documents Dieu-Thu Le Email: dieuthu.le@unitn.it Trento University April 6, 2011 1 / 43 Outline Introduction Latent Dirichlet Allocation Gibbs Sampling Short Text

More information

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang Deep Learning Basics Lecture 7: Factor Analysis Princeton University COS 495 Instructor: Yingyu Liang Supervised v.s. Unsupervised Math formulation for supervised learning Given training data x i, y i

More information

Foundations For Learning in the Age of Big Data. Maria-Florina Balcan

Foundations For Learning in the Age of Big Data. Maria-Florina Balcan Foundations For Learning in the Age of Big Data Maria-Florina Balcan Modern Machine Learning New applications Explosion of data Modern ML: New Learning Approaches Modern applications: massive amounts of

More information

COMS 4771 Lecture Boosting 1 / 16

COMS 4771 Lecture Boosting 1 / 16 COMS 4771 Lecture 12 1. Boosting 1 / 16 Boosting What is boosting? Boosting: Using a learning algorithm that provides rough rules-of-thumb to construct a very accurate predictor. 3 / 16 What is boosting?

More information

Learning with Large Number of Experts: Component Hedge Algorithm

Learning with Large Number of Experts: Component Hedge Algorithm Learning with Large Number of Experts: Component Hedge Algorithm Giulia DeSalvo and Vitaly Kuznetsov Courant Institute March 24th, 215 1 / 3 Learning with Large Number of Experts Regret of RWM is O( T

More information

Notes on Latent Semantic Analysis

Notes on Latent Semantic Analysis Notes on Latent Semantic Analysis Costas Boulis 1 Introduction One of the most fundamental problems of information retrieval (IR) is to find all documents (and nothing but those) that are semantically

More information