Efficient algorithms for estimating multi-view mixture models
|
|
- Frederica Blake
- 5 years ago
- Views:
Transcription
1 Efficient algorithms for estimating multi-view mixture models Daniel Hsu Microsoft Research, New England
2 Outline Multi-view mixture models Multi-view method-of-moments Some applications and open questions Concluding remarks
3 Part 1. Multi-view mixture models Multi-view mixture models Unsupervised learning and mixture models Multi-view mixture models Complexity barriers Multi-view method-of-moments Some applications and open questions Concluding remarks
4 Unsupervised learning Many modern applications of machine learning: high-dimensional data from many diverse sources, but mostly unlabeled.
5 Unsupervised learning Many modern applications of machine learning: high-dimensional data from many diverse sources, but mostly unlabeled. Unsupervised learning: extract useful info from this data. Disentangle sub-populations in data source. Discover useful representations for downstream stages of learning pipeline (e.g., supervised learning).
6 Mixture models Simple latent variable model: mixture model h x h [k] := {1, 2,..., k} (hidden); x R d (observed); Pr[ h = j ] = w j ; x h Ph ; so x has a mixture distribution P( x) = w 1 P 1 ( x) + w 2 P 2 ( x) + + w k P k ( x).
7 Mixture models Simple latent variable model: mixture model h x h [k] := {1, 2,..., k} (hidden); x R d (observed); Pr[ h = j ] = w j ; x h Ph ; so x has a mixture distribution P( x) = w 1 P 1 ( x) + w 2 P 2 ( x) + + w k P k ( x). Typical use: learn about constituent sub-populations (e.g., clusters) in data source.
8 Multi-view mixture models Can we take advantage of diverse sources of information?
9 Multi-view mixture models Can we take advantage of diverse sources of information? h [k ], h ~x1 ~x2 ~x` ~x1 Rd1, ~x2 Rd2,..., ~x` Rd`. k = # components, ` = # views (e.g., audio, video, text). View 1: ~x1 Rd1 View 2: ~x2 Rd2 View 3: ~x3 Rd3
10 Multi-view mixture models Can we take advantage of diverse sources of information? h x 1 x 2 x l h [k], x 1 R d 1, x 2 R d 2,..., x l R d l. k = # components, l = # views (e.g., audio, video, text). View 1: x 1 R d 1 View 2: x 2 R d 2 View 3: x 3 R d 3
11 Multi-view mixture models Multi-view assumption: Views are conditionally independent given the component. View 1: x 1 R d 1 View 2: x 2 R d 2 View 3: x 3 R d 3 Larger k (# components): more sub-populations to disentangle. Larger l (# views): more non-redundant sources of information.
12 Semi-parametric estimation task Parameters of component distributions: Mixing weights w j := Pr[h = j], j [k]; Conditional means µ v,j := E[ x v h = j] R dv, j [k], v [l]. Goal: Estimate mixing weights and conditional means from independent copies of ( x 1, x 2,..., x l ).
13 Semi-parametric estimation task Parameters of component distributions: Mixing weights w j := Pr[h = j], j [k]; Conditional means µ v,j := E[ x v h = j] R dv, j [k], v [l]. Goal: Estimate mixing weights and conditional means from independent copies of ( x 1, x 2,..., x l ). Questions: 1. How do we estimate {w j } and { µ v,j } without observing h? 2. How many views l are sufficient to learn with poly(k) computational / sample complexity?
14 Some barriers to efficient estimation Challenge: many difficult parametric estimation tasks reduce to this estimation problem.
15 Some barriers to efficient estimation Challenge: many difficult parametric estimation tasks reduce to this estimation problem. Cryptographic barrier: discrete HMM parameter estimation as hard as learning parity functions with noise (Mossel-Roch, 06).
16 Some barriers to efficient estimation Challenge: many difficult parametric estimation tasks reduce to this estimation problem. Cryptographic barrier: discrete HMM parameter estimation as hard as learning parity functions with noise (Mossel-Roch, 06). Statistical barrier: Gaussian mixtures in R 1 can require exp(ω(k)) samples to estimate parameters, even if components are wellseparated (Moitra-Valiant, 10).
17 Some barriers to efficient estimation Challenge: many difficult parametric estimation tasks reduce to this estimation problem. Cryptographic barrier: discrete HMM parameter estimation as hard as learning parity functions with noise (Mossel-Roch, 06). Statistical barrier: Gaussian mixtures in R 1 can require exp(ω(k)) samples to estimate parameters, even if components are wellseparated (Moitra-Valiant, 10). In practice: resort to local search (e.g., EM), often subject to slow convergence and inaccurate local optima.
18 Making progress: Gaussian mixture model Gaussian mixture model: problem becomes easier if assume some large minimum separation between component means (Dasgupta, 99): sep := min i j µ i µ j max{σ i, σ j }.
19 Making progress: Gaussian mixture model Gaussian mixture model: problem becomes easier if assume some large minimum separation between component means (Dasgupta, 99): sep := min i j µ i µ j max{σ i, σ j }. sep = Ω(d c ): interpoint distance-based methods / EM (Dasgupta, 99; Dasgupta-Schulman, 00; Arora-Kannan, 00) sep = Ω(k c ): first use PCA to k dimensions (Vempala-Wang, 02; Kannan-Salmasian-Vempala, 05; Achlioptas-McSherry, 05) Also works for mixtures of log-concave distributions.
20 Making progress: Gaussian mixture model Gaussian mixture model: problem becomes easier if assume some large minimum separation between component means (Dasgupta, 99): sep := min i j µ i µ j max{σ i, σ j }. sep = Ω(d c ): interpoint distance-based methods / EM (Dasgupta, 99; Dasgupta-Schulman, 00; Arora-Kannan, 00) sep = Ω(k c ): first use PCA to k dimensions (Vempala-Wang, 02; Kannan-Salmasian-Vempala, 05; Achlioptas-McSherry, 05) Also works for mixtures of log-concave distributions. No minimum separation requirement: method-of-moments but exp(ω(k)) running time / sample size (Kalai-Moitra-Valiant, 10; Belkin-Sinha, 10; Moitra-Valiant, 10)
21 Making progress: discrete hidden Markov models Hardness reductions create HMMs with degenerate output and next-state distributions Pr[ x t = h t = 1] Pr[ x t = h t = 2] Pr[ x t = h t = 3]
22 Making progress: discrete hidden Markov models Hardness reductions create HMMs with degenerate output and next-state distributions Pr[ x t = h t = 1] Pr[ x t = h t = 2] Pr[ x t = h t = 3] These instances are avoided by assuming parameter matrices are full-rank (Mossel-Roch, 06; Hsu-Kakade-Zhang, 09)
23 What we do This work: given 3 views, mild non-degeneracy conditions imply efficient algorithms for estimation.
24 What we do This work: given 3 views, mild non-degeneracy conditions imply efficient algorithms for estimation. Non-degeneracy condition for multi-view mixture model: Conditional means { µ v,1, µ v,2,..., µ v,k } are linearly independent for each view v [l], and w > 0. Requires high-dimensional observations (d v k)!
25 What we do This work: given 3 views, mild non-degeneracy conditions imply efficient algorithms for estimation. Non-degeneracy condition for multi-view mixture model: Conditional means { µ v,1, µ v,2,..., µ v,k } are linearly independent for each view v [l], and w > 0. Requires high-dimensional observations (d v k)! New efficient learning guarantees for parametric models (e.g., mixtures of Gaussians, general HMMs)
26 What we do This work: given 3 views, mild non-degeneracy conditions imply efficient algorithms for estimation. Non-degeneracy condition for multi-view mixture model: Conditional means { µ v,1, µ v,2,..., µ v,k } are linearly independent for each view v [l], and w > 0. Requires high-dimensional observations (d v k)! New efficient learning guarantees for parametric models (e.g., mixtures of Gaussians, general HMMs) General tensor decomposition framework applicable to a wide variety of estimation problems.
27 Part 2. Multi-view method-of-moments Multi-view mixture models Multi-view method-of-moments Overview Structure of moments Uniqueness of decomposition Computing the decomposition Asymmetric views Some applications and open questions Concluding remarks
28 The plan I First, assume views are (conditionally) exchangeable, and derive basic algorithm.
29 The plan I First, assume views are (conditionally) exchangeable, and derive basic algorithm. I Then, provide reduction from general multi-view setting to exchangeable case.
30 Simpler case: exchangeable views (Conditionally) exchangeable views: assume the views have the same conditional means, i.e., E[ x v h = j ] µ j, j [k], v [l].
31 Simpler case: exchangeable views (Conditionally) exchangeable views: assume the views have the same conditional means, i.e., E[ x v h = j ] µ j, j [k], v [l]. Motivating setting: bag-of-words model, x 1, x 2,..., x l l exchangeable words in a document. One-hot encoding: x v = e i v-th word in document is i-th word in vocab (where e i {0, 1} d has 1 in i-th position, 0 elsewhere). ( µ j ) i = E[( x v ) i h = j] = Pr[ x v = e i h = j], i [d], j [k].
32 Key ideas 1. Method-of-moments: conditional means are revealed by appropriate low-rank decompositions of moment matrices and tensors. 2. Third-order tensor decomposition is uniquely determined by directions of (locally) maximum skew. 3. The required local optimization can be efficiently performed in poly time.
33 Algebraic structure in moments Recall: E[ x v h = j ] = µ j.
34 Algebraic structure in moments Recall: E[ x v h = j ] = µ j. By conditional independence and exchangeability of x 1, x 2,..., x l given h, Pairs := E[ x 1 x 2 ] = E [ E[ x 1 h] E[ x 2 h] ] = E[ µ h µ h ] k = w i µ i µ i R d d. i=1
35 Algebraic structure in moments Recall: E[ x v h = j ] = µ j. By conditional independence and exchangeability of x 1, x 2,..., x l given h, Pairs := E[ x 1 x 2 ] = E [ E[ x 1 h] E[ x 2 h] ] = E[ µ h µ h ] k = w i µ i µ i R d d. i=1 Triples := E[ x 1 x 2 x 3 ] k = w i µ i µ i µ i R d d d, etc. i=1 (If only we could extract these low-rank decompositions... )
36 2nd moment: subspace spanned by conditional means
37 2nd moment: subspace spanned by conditional means Non-degeneracy assumption ({ µ i } linearly independent)
38 2nd moment: subspace spanned by conditional means Non-degeneracy assumption ({ µ i } linearly independent) = Pairs = k i=1 w i µ i µ i symmetric psd and rank k
39 2nd moment: subspace spanned by conditional means Non-degeneracy assumption ({ µ i } linearly independent) = Pairs = k i=1 w i µ i µ i symmetric psd and rank k = Pairs equips k-dim subspace span{ µ 1, µ 2,..., µ k } with inner product Pairs( x, y) := x Pairs y.
40 2nd moment: subspace spanned by conditional means Non-degeneracy assumption ({ µ i } linearly independent) = Pairs = k i=1 w i µ i µ i symmetric psd and rank k = Pairs equips k-dim subspace span{ µ 1, µ 2,..., µ k } with inner product Pairs( x, y) := x Pairs y. However, { µ i } not generally determined by just Pairs (e.g., { µ i } are not necessarily orthogonal).
41 2nd moment: subspace spanned by conditional means Non-degeneracy assumption ({ µ i } linearly independent) = Pairs = k i=1 w i µ i µ i symmetric psd and rank k = Pairs equips k-dim subspace span{ µ 1, µ 2,..., µ k } with inner product Pairs( x, y) := x Pairs y. However, { µ i } not generally determined by just Pairs (e.g., { µ i } are not necessarily orthogonal). Must look at higher-order moments?
42 3rd moment: (cross) skew maximizers Claim: Up to third-moment (i.e., 3 views) suffices. View Triples: R d R d R d R as trilinear form.
43 3rd moment: (cross) skew maximizers Claim: Up to third-moment (i.e., 3 views) suffices. View Triples: R d R d R d R as trilinear form. Theorem Each isolated local maximizer η of max Triples( η, η, η) s.t. Pairs( η, η) 1 η R d satisfies, for some i [k], Pairs η = w i µ i, Triples( η, η, η ) = 1 wi.
44 3rd moment: (cross) skew maximizers Claim: Up to third-moment (i.e., 3 views) suffices. View Triples: R d R d R d R as trilinear form. Theorem Each isolated local maximizer η of max Triples( η, η, η) s.t. Pairs( η, η) 1 η R d satisfies, for some i [k], Pairs η = w i µ i, Triples( η, η, η ) = 1 wi. Also: these maximizers can be found efficiently and robustly.
45 Variational analysis max Triples( η, η, η) s.t. Pairs( η, η) 1 η R d
46 Variational analysis max Triples( η, η, η) s.t. Pairs( η, η) 1 η R d (Substitute Pairs = k i=1 w i µ i µ i and Triples = k i=1 w i µ i µ i µ i.)
47 Variational analysis max η R d k w i ( η µ i ) 3 i=1 s.t. k w i ( η µ i ) 2 1 (Substitute Pairs = k i=1 w i µ i µ i and Triples = k i=1 w i µ i µ i µ i.) i=1
48 Variational analysis max η R d k w i ( η µ i ) 3 i=1 s.t. k w i ( η µ i ) 2 1 i=1 (Let θ i := w i ( η µ i ) for i [k].)
49 Variational analysis max η R d k i=1 1 wi ( w i η µ i ) 3 s.t. k ( w i η µ i ) 2 1 i=1 (Let θ i := w i ( η µ i ) for i [k].)
50 Variational analysis max θ R k k i=1 1 wi θ 3 i s.t. k θi 2 1 i=1 (Let θ i := w i ( η µ i ) for i [k].)
51 Variational analysis max θ R k k i=1 1 wi θ 3 i s.t. k θi 2 1 i=1 (Let θ i := w i ( η µ i ) for i [k].) Isolated local maximizers θ (found via gradient ascent) are e 1 = (1, 0, 0,... ), e 2 = (0, 1, 0,... ), etc. which means that each η satisfies, for some i [k], wj ( η µ j ) = {1 j = i 0 j i.
52 Variational analysis max θ R k k i=1 1 wi θ 3 i s.t. k θi 2 1 i=1 (Let θ i := w i ( η µ i ) for i [k].) Isolated local maximizers θ (found via gradient ascent) are e 1 = (1, 0, 0,... ), e 2 = (0, 1, 0,... ), etc. which means that each η satisfies, for some i [k], Therefore wj ( η µ j ) = {1 j = i Pairs η = 0 j i. k ( ) w j µ j η µ j = wi µ i. j=1
53 Extracting all isolated local maximizers 1. Start with T := Triples.
54 Extracting all isolated local maximizers 1. Start with T := Triples. 2. Find isolated local maximizer of T ( η, η, η) s.t. Pairs( η, η) 1 via gradient ascent from random η range(pairs). Say maximum is λ and maximizer is η.
55 Extracting all isolated local maximizers 1. Start with T := Triples. 2. Find isolated local maximizer of T ( η, η, η) s.t. Pairs( η, η) 1 via gradient ascent from random η range(pairs). Say maximum is λ and maximizer is η. 3. Deflation: replace T with T λ η η η. Goto step 2.
56 Extracting all isolated local maximizers 1. Start with T := Triples. 2. Find isolated local maximizer of T ( η, η, η) s.t. Pairs( η, η) 1 via gradient ascent from random η range(pairs). Say maximum is λ and maximizer is η. 3. Deflation: replace T with T λ η η η. Goto step 2. A variant of this runs in polynomial time (w.h.p.), and is robust to perturbations to Pairs and Triples.
57 General case: asymmetric views Each view v has different set of conditional means {~ µv,1, µ ~ v,2,..., µ ~ v,k } Rdv.
58 General case: asymmetric views Each view v has different set of conditional means {~ µv,1, µ ~ v,2,..., µ ~ v,k } Rdv. Reduction: transform ~x1 and ~x2 to look like ~x3 via linear transformations.
59 Asymmetric cross moments Define asymmetric cross moment: Pairs u,v := E[ x u x v ].
60 Asymmetric cross moments Define asymmetric cross moment: Pairs u,v := E[ x u x v ]. Transforming view v to view 3: C v 3 := E[ x 3 x u ] E[ x v x u ] R d 3 d v where denotes Moore-Penrose pseudoinverse.
61 Asymmetric cross moments Define asymmetric cross moment: Pairs u,v := E[ x u x v ]. Transforming view v to view 3: C v 3 := E[ x 3 x u ] E[ x v x u ] R d 3 d v where denotes Moore-Penrose pseudoinverse. Simple exercise to show E[C v 3 x v h = j] = µ 3,j so C v 3 x v behaves like x 3 (as far as our algorithm can tell).
62 Part 3. Some applications and open questions Multi-view mixture models Multi-view method-of-moments Some applications and open questions Mixtures of Gaussians Hidden Markov models and other models Topic models Open questions Concluding remarks
63 Mixtures of axis-aligned Gaussians Mixture of axis-aligned Gaussian in R n, with component means µ 1, µ 2,..., µ k R n ; no minimum separation requirement. h x 1 x 2 x n
64 Mixtures of axis-aligned Gaussians Mixture of axis-aligned Gaussian in R n, with component means µ 1, µ 2,..., µ k R n ; no minimum separation requirement. h Assumptions: x 1 x 2 x n non-degeneracy: component means span k dim subspace. weak incoherence condition: component means not perfectly aligned with coordinate axes similar to spreading condition of (Chaudhuri-Rao, 08).
65 Mixtures of axis-aligned Gaussians Mixture of axis-aligned Gaussian in R n, with component means µ 1, µ 2,..., µ k R n ; no minimum separation requirement. h Assumptions: x 1 x 2 x n non-degeneracy: component means span k dim subspace. weak incoherence condition: component means not perfectly aligned with coordinate axes similar to spreading condition of (Chaudhuri-Rao, 08). Then, randomly partitioning coordinates into l 3 views guarantees (w.h.p.) that non-degeneracy holds in all l views.
66 Hidden Markov models and others h 1 h 2 h 3 x 1 x 2 x 3
67 Hidden Markov models and others h 1 h 2 h 3 x 1 x 2 x 3 h x 1 x 2 x 3
68 Hidden Markov models and others h 1 h 2 h 3 x 1 x 2 x 3 h x 1 x 2 x 3 Other models: 1. Mixtures of Gaussians (Hsu-Kakade, ITCS 13) 2. HMMs (Anandkumar-Hsu-Kakade, COLT 12) 3. Latent Dirichlet Allocation (Anandkumar-Foster-Hsu-Kakade-Liu, NIPS 12) 4. Latent parse trees (Hsu-Kakade-Liang, NIPS 12) 5. Independent Component Analysis (Arora-Ge-Moitra-Sachdeva, NIPS 12; Hsu-Kakade, ITCS 13)
69 Bag-of-words clustering model ( µ j ) i = Pr[ see word i in document document topic is j ]. Corpus: New York Times (from UCI), articles. Vocabulary size: d = words. Chose k = 50. For each topic j, show top 10 words i.
70 Bag-of-words clustering model ( µ j ) i = Pr[ see word i in document document topic is j ]. Corpus: New York Times (from UCI), articles. Vocabulary size: d = words. Chose k = 50. For each topic j, show top 10 words i. sales run school drug player economic inning student patient tiger_wood consumer hit teacher million won major game program company shot home season official doctor play indicator home public companies round weekly right children percent win order games high cost tournament claim dodger education program tour scheduled left district health right
71 Bag-of-words clustering model palestinian tax cup point yard israel cut minutes game game israeli percent oil team play yasser_arafat bush water shot season peace billion add play team israeli plan tablespoon laker touchdown israelis bill food season quarterback leader taxes teaspoon half coach official million pepper lead defense attack congress sugar games quarter
72 Bag-of-words clustering model percent al_gore car book taliban stock campaign race children attack market president driver ages afghanistan fund george_bush team author official investor bush won read military companies clinton win newspaper u_s analyst vice racing web united_states money presidential track writer terrorist investment million season written war economy democratic lap sales bin
73 Bag-of-words clustering model com court show film music www case network movie song site law season director group web lawyer nbc play part sites federal cb character new_york information government program actor company online decision television show million mail trial series movies band internet microsoft night million show telegram right new_york part album etc.
74 Some open questions What if k > d v? (relevant to overcomplete dictionary learning)
75 Some open questions What if k > d v? (relevant to overcomplete dictionary learning) Apply some non-linear transformations x v f v ( x v )?
76 Some open questions What if k > d v? (relevant to overcomplete dictionary learning) Apply some non-linear transformations x v f v ( x v )? Combine views, e.g., via tensor product x 1,2 := x 1 x 2, x 3,4 := x 3 x 4, x 5,6 := x 5 x 6, etc.?
77 Some open questions What if k > d v? (relevant to overcomplete dictionary learning) Apply some non-linear transformations x v f v ( x v )? Combine views, e.g., via tensor product x 1,2 := x 1 x 2, x 3,4 := x 3 x 4, x 5,6 := x 5 x 6, etc.? Can we relax the multi-view assumption?
78 Some open questions What if k > d v? (relevant to overcomplete dictionary learning) Apply some non-linear transformations x v f v ( x v )? Combine views, e.g., via tensor product x 1,2 := x 1 x 2, x 3,4 := x 3 x 4, x 5,6 := x 5 x 6, etc.? Can we relax the multi-view assumption? Allow for richer hidden state? (e.g., independent component analysis)
79 Some open questions What if k > d v? (relevant to overcomplete dictionary learning) Apply some non-linear transformations x v f v ( x v )? Combine views, e.g., via tensor product x 1,2 := x 1 x 2, x 3,4 := x 3 x 4, x 5,6 := x 5 x 6, etc.? Can we relax the multi-view assumption? Allow for richer hidden state? (e.g., independent component analysis) Gaussianization via random projection?
80 Part 4. Concluding remarks Multi-view mixture models Multi-view method-of-moments Some applications and open questions Concluding remarks
81 Concluding remarks Take-home messages:
82 Concluding remarks Take-home messages: Power of multiple views: Can take advantage of diverse / non-redundant sources of information in unsupervised learning.
83 Concluding remarks Take-home messages: Power of multiple views: Can take advantage of diverse / non-redundant sources of information in unsupervised learning. Overcoming complexity barriers: Some provably hard estimation problems become easy after ruling out degenerate cases.
84 Concluding remarks Take-home messages: Power of multiple views: Can take advantage of diverse / non-redundant sources of information in unsupervised learning. Overcoming complexity barriers: Some provably hard estimation problems become easy after ruling out degenerate cases. Blessing of dimensionality for estimators based on method-of-moments.
85 Thanks! (Co-authors: Anima Anandkumar, Dean Foster, Rong Ge, Sham Kakade, Yi-Kai Liu, Matus Telgarsky)
A Spectral Algorithm for Latent Dirichlet Allocation
A Spectral Algorithm for Latent Dirichlet Allocation Anima Anandkumar University of California Irvine, CA a.anandkumar@uci.edu Dean P. Foster University of Pennsylvania Philadelphia, PA dean@foster.net
More informationComputational Lower Bounds for Statistical Estimation Problems
Computational Lower Bounds for Statistical Estimation Problems Ilias Diakonikolas (USC) (joint with Daniel Kane (UCSD) and Alistair Stewart (USC)) Workshop on Local Algorithms, MIT, June 2018 THIS TALK
More informationEstimating Latent Variable Graphical Models with Moments and Likelihoods
Estimating Latent Variable Graphical Models with Moments and Likelihoods Arun Tejasvi Chaganty Percy Liang Stanford University June 18, 2014 Chaganty, Liang (Stanford University) Moments and Likelihoods
More informationOrthogonal tensor decomposition
Orthogonal tensor decomposition Daniel Hsu Columbia University Largely based on 2012 arxiv report Tensor decompositions for learning latent variable models, with Anandkumar, Ge, Kakade, and Telgarsky.
More informationGuaranteed Learning of Latent Variable Models through Spectral and Tensor Methods
Guaranteed Learning of Latent Variable Models through Spectral and Tensor Methods Anima Anandkumar U.C. Irvine Application 1: Clustering Basic operation of grouping data points. Hypothesis: each data point
More informationECE 598: Representation Learning: Algorithms and Models Fall 2017
ECE 598: Representation Learning: Algorithms and Models Fall 2017 Lecture 1: Tensor Methods in Machine Learning Lecturer: Pramod Viswanathan Scribe: Bharath V Raghavan, Oct 3, 2017 11 Introduction Tensors
More informationIdentifiability and Learning of Topic Models: Tensor Decompositions under Structural Constraints
Identifiability and Learning of Topic Models: Tensor Decompositions under Structural Constraints Anima Anandkumar U.C. Irvine Joint work with Daniel Hsu, Majid Janzamin Adel Javanmard and Sham Kakade.
More informationAppendix A. Proof to Theorem 1
Appendix A Proof to Theorem In this section, we prove the sample complexity bound given in Theorem The proof consists of three main parts In Appendix A, we prove perturbation lemmas that bound the estimation
More informationDonald Goldfarb IEOR Department Columbia University UCLA Mathematics Department Distinguished Lecture Series May 17 19, 2016
Optimization for Tensor Models Donald Goldfarb IEOR Department Columbia University UCLA Mathematics Department Distinguished Lecture Series May 17 19, 2016 1 Tensors Matrix Tensor: higher-order matrix
More informationLearning Mixtures of Spherical Gaussians: Moment Methods and Spectral Decompositions
Learning Mixtures of Spherical Gaussians: Moment Methods and Spectral Decompositions [Extended Abstract] ABSTRACT Daniel Hsu Microsoft Research New England dahsu@microsoft.com This work provides a computationally
More informationEfficient Spectral Methods for Learning Mixture Models!
Efficient Spectral Methods for Learning Mixture Models Qingqing Huang 2016 February Laboratory for Information & Decision Systems Based on joint works with Munther Dahleh, Rong Ge, Sham Kakade, Greg Valiant.
More informationDisentangling Gaussians
Disentangling Gaussians Ankur Moitra, MIT November 6th, 2014 Dean s Breakfast Algorithmic Aspects of Machine Learning 2015 by Ankur Moitra. Note: These are unpolished, incomplete course notes. Developed
More informationSum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017
Sum-Product Networks STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017 Introduction Outline What is a Sum-Product Network? Inference Applications In more depth
More informationOverview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated
Fall 3 Computer Vision Overview of Statistical Tools Statistical Inference Haibin Ling Observation inference Decision Prior knowledge http://www.dabi.temple.edu/~hbling/teaching/3f_5543/index.html Bayesian
More informationA Method of Moments for Mixture Models and Hidden Markov Models
JMLR: Workshop and Conference Proceedings vol 23 (2012) 33.1 33.34 25th Annual Conference on Learning Theory A Method of Moments for Mixture Models and Hidden Markov Models Animashree Anandkumar University
More informationThe Informativeness of k-means for Learning Mixture Models
The Informativeness of k-means for Learning Mixture Models Vincent Y. F. Tan (Joint work with Zhaoqiang Liu) National University of Singapore June 18, 2018 1/35 Gaussian distribution For F dimensions,
More informationarxiv: v4 [cs.lg] 13 Nov 2014
Journal of Machine Learning Research 15 (2014) 2773-2832 Submitted 2/13; Revised 3/14; Published 8/14 Tensor Decompositions for Learning Latent Variable Models arxiv:1210.7559v4 [cs.lg] 13 Nov 2014 Animashree
More informationNon-convex Robust PCA: Provable Bounds
Non-convex Robust PCA: Provable Bounds Anima Anandkumar U.C. Irvine Joint work with Praneeth Netrapalli, U.N. Niranjan, Prateek Jain and Sujay Sanghavi. Learning with Big Data High Dimensional Regime Missing
More informationLatent Dirichlet Allocation
Outlines Advanced Artificial Intelligence October 1, 2009 Outlines Part I: Theoretical Background Part II: Application and Results 1 Motive Previous Research Exchangeability 2 Notation and Terminology
More informationLearning mixtures of spherical Gaussians: moment methods and spectral decompositions
Learning mixtures of spherical Gaussians: moment methods and spectral decompositions Daniel Hsu and Sham M. Kakade Microsoft Research New England October 8, 01 Abstract This work provides a computationally
More informationTENSOR DECOMPOSITIONS AND THEIR APPLICATIONS
TENSOR DECOMPOSITIONS AND THEIR APPLICATIONS ANKUR MOITRA MASSACHUSETTS INSTITUTE OF TECHNOLOGY SPEARMAN S HYPOTHESIS Charles Spearman (1904): There are two types of intelligence, educ%ve and reproduc%ve
More informationGlobal Analysis of Expectation Maximization for Mixtures of Two Gaussians
Global Analysis of Expectation Maximization for Mixtures of Two Gaussians Ji Xu Columbia University jixu@cs.columbia.edu Daniel Hsu Columbia University djhsu@cs.columbia.edu Arian Maleki Columbia University
More informationLearning Tractable Graphical Models: Latent Trees and Tree Mixtures
Learning Tractable Graphical Models: Latent Trees and Tree Mixtures Anima Anandkumar U.C. Irvine Joint work with Furong Huang, U.N. Niranjan, Daniel Hsu and Sham Kakade. High-Dimensional Graphical Modeling
More informationRecent Advances in Bayesian Inference Techniques
Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian
More informationAn Introduction to Spectral Learning
An Introduction to Spectral Learning Hanxiao Liu November 8, 2013 Outline 1 Method of Moments 2 Learning topic models using spectral properties 3 Anchor words Preliminaries X 1,, X n p (x; θ), θ = (θ 1,
More informationDictionary Learning Using Tensor Methods
Dictionary Learning Using Tensor Methods Anima Anandkumar U.C. Irvine Joint work with Rong Ge, Majid Janzamin and Furong Huang. Feature learning as cornerstone of ML ML Practice Feature learning as cornerstone
More informationTHE HIDDEN CONVEXITY OF SPECTRAL CLUSTERING
THE HIDDEN CONVEXITY OF SPECTRAL CLUSTERING Luis Rademacher, Ohio State University, Computer Science and Engineering. Joint work with Mikhail Belkin and James Voss This talk A new approach to multi-way
More informationA Spectral Algorithm for Latent Dirichlet Allocation
Noname manuscript No. (will be inserted by the editor) A Spectral Algorithm for Latent Dirichlet Allocation Anima Anandkumar Dean P. Foster Daniel Hsu Sham M. Kakade Yi-Kai Liu Received: date / Accepted:
More informationLearning Linear Bayesian Networks with Latent Variables
Learning Linear Bayesian Networks with Latent Variables Adel Javanmard Stanford University joint work with Anima Anandkumar, Daniel Hsu y, Sham Kakade y University of California, Irvine y Microsoft Research,
More informationA Robust Form of Kruskal s Identifiability Theorem
A Robust Form of Kruskal s Identifiability Theorem Aditya Bhaskara (Google NYC) Joint work with Moses Charikar (Princeton University) Aravindan Vijayaraghavan (NYU Northwestern) Background: understanding
More informationDimension Reduction (PCA, ICA, CCA, FLD,
Dimension Reduction (PCA, ICA, CCA, FLD, Topic Models) Yi Zhang 10-701, Machine Learning, Spring 2011 April 6 th, 2011 Parts of the PCA slides are from previous 10-701 lectures 1 Outline Dimension reduction
More informationCOMS 4771 Lecture Course overview 2. Maximum likelihood estimation (review of some statistics)
COMS 4771 Lecture 1 1. Course overview 2. Maximum likelihood estimation (review of some statistics) 1 / 24 Administrivia This course Topics http://www.satyenkale.com/coms4771/ 1. Supervised learning Core
More informationSparse Principal Component Analysis via Alternating Maximization and Efficient Parallel Implementations
Sparse Principal Component Analysis via Alternating Maximization and Efficient Parallel Implementations Martin Takáč The University of Edinburgh Joint work with Peter Richtárik (Edinburgh University) Selin
More informationarxiv: v1 [cs.lg] 13 Aug 2013
When are Overcomplete Topic Models Identifiable? Uniqueness of Tensor Tucker Decompositions with Structured Sparsity arxiv:1308.2853v1 [cs.lg] 13 Aug 2013 Animashree Anandkumar, Daniel Hsu, Majid Janzamin
More informationTensor Methods for Feature Learning
Tensor Methods for Feature Learning Anima Anandkumar U.C. Irvine Feature Learning For Efficient Classification Find good transformations of input for improved classification Figures used attributed to
More informationGuaranteed Learning of Latent Variable Models through Tensor Methods
Guaranteed Learning of Latent Variable Models through Tensor Methods Furong Huang University of Maryland furongh@cs.umd.edu ACM SIGMETRICS Tutorial 2018 1/75 Tutorial Topic Learning algorithms for latent
More informationUnsupervised Learning
The Complexity of Unsupervised Learning Santosh Vempala, Georgia Tech Unsupervised learning } Data (imagine cool images here) } with no labels (or teachers) } How to } Understand it/find patterns in it?
More informationMachine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012
Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Principal Components Analysis Le Song Lecture 22, Nov 13, 2012 Based on slides from Eric Xing, CMU Reading: Chap 12.1, CB book 1 2 Factor or Component
More informationDiversifying Restricted Boltzmann Machine for Document Modeling. Pengtao Xie. Joint work with Yuntian Deng and Eric Xing Carnegie Mellon University
Diversifying Restricted Boltzmann Machine for Document Modeling Pengtao Xie Joint work with Yuntian Deng and Eric Xing Carnegie Mellon University 1 Document Modeling Politics nimal Food 2 Document Modeling
More informationDimensionality Reduction and Principle Components Analysis
Dimensionality Reduction and Principle Components Analysis 1 Outline What is dimensionality reduction? Principle Components Analysis (PCA) Example (Bishop, ch 12) PCA vs linear regression PCA as a mixture
More informationPCA, Kernel PCA, ICA
PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per
More informationAN INTRODUCTION TO TOPIC MODELS
AN INTRODUCTION TO TOPIC MODELS Michael Paul December 4, 2013 600.465 Natural Language Processing Johns Hopkins University Prof. Jason Eisner Making sense of text Suppose you want to learn something about
More informationAlgorithms for Generalized Topic Modeling
Algorithms for Generalized Topic Modeling Avrim Blum Toyota Technological Institute at Chicago avrim@ttic.edu Nika Haghtalab Computer Science Department Carnegie Mellon University nhaghtal@cs.cmu.edu Abstract
More informationDeep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści
Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?
More informationEstimating Covariance Using Factorial Hidden Markov Models
Estimating Covariance Using Factorial Hidden Markov Models João Sedoc 1,2 with: Jordan Rodu 3, Lyle Ungar 1, Dean Foster 1 and Jean Gallier 1 1 University of Pennsylvania Philadelphia, PA joao@cis.upenn.edu
More informationTopic Models. Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW. Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1
Topic Models Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1 Low-Dimensional Space for Documents Last time: embedding space
More informationCS534 Machine Learning - Spring Final Exam
CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the
More informationEstimating Latent-Variable Graphical Models using Moments and Likelihoods
Arun Tejasvi Chaganty Percy Liang Stanford University, Stanford, CA, USA CHAGANTY@CS.STANFORD.EDU PLIANG@CS.STANFORD.EDU Abstract Recent work on the method of moments enable consistent parameter estimation,
More informationProbabilistic Time Series Classification
Probabilistic Time Series Classification Y. Cem Sübakan Boğaziçi University 25.06.2013 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 1 / 54 Problem Statement The goal is to assign
More informationOnline Nonnegative Matrix Factorization with General Divergences
Online Nonnegative Matrix Factorization with General Divergences Vincent Y. F. Tan (ECE, Mathematics, NUS) Joint work with Renbo Zhao (NUS) and Huan Xu (GeorgiaTech) IWCT, Shanghai Jiaotong University
More informationRecent Advances in Ranking: Adversarial Respondents and Lower Bounds on the Bayes Risk
Recent Advances in Ranking: Adversarial Respondents and Lower Bounds on the Bayes Risk Vincent Y. F. Tan Department of Electrical and Computer Engineering, Department of Mathematics, National University
More informationOnline Forest Density Estimation
Online Forest Density Estimation Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr UAI 16 1 Outline 1 Probabilistic Graphical Models 2 Online Density Estimation 3 Online Forest Density
More informationCS 6375 Machine Learning
CS 6375 Machine Learning Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate Course Info. Instructor: Nicholas Ruozzi Office: ECSS 3.409 Office hours: Tues.
More informationBeyond Log-concavity: Provable Guarantees for Sampling Multi-modal Distributions using Simulated Tempering Langevin Monte Carlo
Beyond Log-concavity: Provable Guarantees for Sampling Multi-modal Distributions using Simulated Tempering Langevin Monte Carlo Rong Ge *, Holden Lee, Andrej Risteski Abstract A key task in Bayesian statistics
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationRandomized Algorithms
Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models
More informationLecture 21: Spectral Learning for Graphical Models
10-708: Probabilistic Graphical Models 10-708, Spring 2016 Lecture 21: Spectral Learning for Graphical Models Lecturer: Eric P. Xing Scribes: Maruan Al-Shedivat, Wei-Cheng Chang, Frederick Liu 1 Motivation
More informationShort Course Robust Optimization and Machine Learning. Lecture 4: Optimization in Unsupervised Learning
Short Course Robust Optimization and Machine Machine Lecture 4: Optimization in Unsupervised Laurent El Ghaoui EECS and IEOR Departments UC Berkeley Spring seminar TRANSP-OR, Zinal, Jan. 16-19, 2012 s
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Mark Schmidt University of British Columbia Winter 2018 Last Time: Learning and Inference in DAGs We discussed learning in DAG models, log p(x W ) = n d log p(x i j x i pa(j),
More informationComputational Linear Algebra
Computational Linear Algebra PD Dr. rer. nat. habil. Ralf-Peter Mundani Computation in Engineering / BGU Scientific Computing in Computer Science / INF Winter Term 2018/19 Part 6: Some Other Stuff PD Dr.
More informationTensor Decompositions for Machine Learning. G. Roeder 1. UBC Machine Learning Reading Group, June University of British Columbia
Network Feature s Decompositions for Machine Learning 1 1 Department of Computer Science University of British Columbia UBC Machine Learning Group, June 15 2016 1/30 Contact information Network Feature
More informationLearning Mixtures of Gaussians in High Dimensions
Learning Mixtures of Gaussians in High Dimensions [Extended Abstract] Rong Ge Microsoft Research, New England Qingqing Huang MIT, EECS Sham M. Kakade Microsoft Research, New England ABSTRACT Efficiently
More informationProvable Alternating Minimization Methods for Non-convex Optimization
Provable Alternating Minimization Methods for Non-convex Optimization Prateek Jain Microsoft Research, India Joint work with Praneeth Netrapalli, Sujay Sanghavi, Alekh Agarwal, Animashree Anandkumar, Rashish
More informationECE 5984: Introduction to Machine Learning
ECE 5984: Introduction to Machine Learning Topics: (Finish) Expectation Maximization Principal Component Analysis (PCA) Readings: Barber 15.1-15.4 Dhruv Batra Virginia Tech Administrativia Poster Presentation:
More informationRegression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)
Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features
More informationarxiv: v1 [stat.ml] 11 Dec 2016
A New Spectral Method for Latent Variable Models Matteo Ruffini, Marta Casanellas, and Ricard Gavaldà Universitat Politècnica de Catalunya arxiv:161.03409v1 [stat.ml] 11 Dec 016 December 13, 016 Abstract
More informationRegression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)
Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features
More informationCS Lecture 18. Topic Models and LDA
CS 6347 Lecture 18 Topic Models and LDA (some slides by David Blei) Generative vs. Discriminative Models Recall that, in Bayesian networks, there could be many different, but equivalent models of the same
More informationLecture 22 Exploratory Text Analysis & Topic Models
Lecture 22 Exploratory Text Analysis & Topic Models Intro to NLP, CS585, Fall 2014 http://people.cs.umass.edu/~brenocon/inlp2014/ Brendan O Connor [Some slides borrowed from Michael Paul] 1 Text Corpus
More informationIntroduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin
1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)
More informationLocal Maxima in the Likelihood of Gaussian Mixture Models: Structural Results and Algorithmic Consequences
Local Maxima in the Likelihood of Gaussian Mixture Models: Structural Results and Algorithmic Consequences Chi Jin UC Berkeley chijin@cs.berkeley.edu Yuchen Zhang UC Berkeley yuczhang@berkeley.edu Sivaraman
More informationSum-of-Squares Method, Tensor Decomposition, Dictionary Learning
Sum-of-Squares Method, Tensor Decomposition, Dictionary Learning David Steurer Cornell Approximation Algorithms and Hardness, Banff, August 2014 for many problems (e.g., all UG-hard ones): better guarantees
More informationIsotropic PCA and Affine-Invariant Clustering
Isotropic PCA and Affine-Invariant Clustering (Extended Abstract) S. Charles Brubaker Santosh S. Vempala Georgia Institute of Technology Atlanta, GA 30332 {brubaker,vempala}@cc.gatech.edu Abstract We present
More informationTopic Modeling Using Latent Dirichlet Allocation (LDA)
Topic Modeling Using Latent Dirichlet Allocation (LDA) Porter Jenkins and Mimi Brinberg Penn State University prj3@psu.edu mjb6504@psu.edu October 23, 2017 Porter Jenkins and Mimi Brinberg (PSU) LDA October
More informationSpectral Learning of Hidden Markov Models with Group Persistence
000 00 00 00 00 00 00 007 008 009 00 0 0 0 0 0 0 07 08 09 00 0 0 0 0 0 0 07 08 09 00 0 0 0 0 0 0 07 08 09 00 0 0 0 0 0 0 07 08 09 00 0 0 0 0 Abstract In this paper, we develop a general Method of Moments
More informationTensor Decomposition by Modified BCM Neurons Finds Mixture Means Through Input Triplets
Tensor Decomposition by Modified BCM Neurons Finds Mixture Means Through Input Triplets Matthew Lawlor Applied Mathematics Yale University New Haven, CT 0650 matthew.lawlor@yale.edu Steven W. Zucer Computer
More informationMaximum Likelihood Estimation for Mixtures of Spherical Gaussians is NP-hard
Journal of Machine Learning Research 18 018 1-11 Submitted 1/16; Revised 1/16; Published 4/18 Maximum Likelihood Estimation for Mixtures of Spherical Gaussians is NP-hard Christopher Tosh Sanjoy Dasgupta
More informationdate: math analysis 2 chapter 18: curve fitting and models
name: period: date: math analysis 2 mr. mellina chapter 18: curve fitting and models Sections: 18.1 Introduction to Curve Fitting; the Least-Squares Line 18.2 Fitting Exponential Curves 18.3 Fitting Power
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin
More informationDATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD
DATA MINING LECTURE 8 Dimensionality Reduction PCA -- SVD The curse of dimensionality Real data usually have thousands, or millions of dimensions E.g., web documents, where the dimensionality is the vocabulary
More informationCS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs
CS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs Professor Erik Sudderth Brown University Computer Science October 6, 2016 Some figures and materials courtesy
More informationFourier PCA. Navin Goyal (MSR India), Santosh Vempala (Georgia Tech) and Ying Xiao (Georgia Tech)
Fourier PCA Navin Goyal (MSR India), Santosh Vempala (Georgia Tech) and Ying Xiao (Georgia Tech) Introduction 1. Describe a learning problem. 2. Develop an efficient tensor decomposition. Independent component
More informationCS264: Beyond Worst-Case Analysis Lecture #15: Topic Modeling and Nonnegative Matrix Factorization
CS264: Beyond Worst-Case Analysis Lecture #15: Topic Modeling and Nonnegative Matrix Factorization Tim Roughgarden February 28, 2017 1 Preamble This lecture fulfills a promise made back in Lecture #1,
More informationExpectation Maximization, and Learning from Partly Unobserved Data (part 2)
Expectation Maximization, and Learning from Partly Unobserved Data (part 2) Machine Learning 10-701 April 2005 Tom M. Mitchell Carnegie Mellon University Clustering Outline K means EM: Mixture of Gaussians
More informationSample and Computationally Efficient Active Learning. Maria-Florina Balcan Carnegie Mellon University
Sample and Computationally Efficient Active Learning Maria-Florina Balcan Carnegie Mellon University Machine Learning is Shaping the World Highly successful discipline with lots of applications. Computational
More informationGenerative Clustering, Topic Modeling, & Bayesian Inference
Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week
More informationNon-Parametric Bayes
Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian
More information26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G.
10-708: Probabilistic Graphical Models, Spring 2015 26 : Spectral GMs Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G. 1 Introduction A common task in machine learning is to work with
More informationRecent Advances in Non-Convex Optimization and its Implications to Learning
Recent Advances in Non-Convex Optimization and its Implications to Learning Anima Anandkumar.. U.C. Irvine.. ICML 2016 Tutorial Optimization at the heart of Machine Learning Unsupervised Learning Most
More informationFactor Analysis (10/2/13)
STA561: Probabilistic machine learning Factor Analysis (10/2/13) Lecturer: Barbara Engelhardt Scribes: Li Zhu, Fan Li, Ni Guan Factor Analysis Factor analysis is related to the mixture models we have studied.
More informationFoundations For Learning in the Age of Big Data. Maria-Florina Balcan
Foundations For Learning in the Age of Big Data Maria-Florina Balcan Modern Machine Learning New applications Explosion of data Classic Paradigm Insufficient Nowadays Modern applications: massive amounts
More informationAdvanced Introduction to Machine Learning
10-715 Advanced Introduction to Machine Learning Homework 3 Due Nov 12, 10.30 am Rules 1. Homework is due on the due date at 10.30 am. Please hand over your homework at the beginning of class. Please see
More informationReduced-Rank Hidden Markov Models
Reduced-Rank Hidden Markov Models Sajid M. Siddiqi Byron Boots Geoffrey J. Gordon Carnegie Mellon University ... x 1 x 2 x 3 x τ y 1 y 2 y 3 y τ Sequence of observations: Y =[y 1 y 2 y 3... y τ ] Assume
More informationTopic Models and Applications to Short Documents
Topic Models and Applications to Short Documents Dieu-Thu Le Email: dieuthu.le@unitn.it Trento University April 6, 2011 1 / 43 Outline Introduction Latent Dirichlet Allocation Gibbs Sampling Short Text
More informationDeep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang
Deep Learning Basics Lecture 7: Factor Analysis Princeton University COS 495 Instructor: Yingyu Liang Supervised v.s. Unsupervised Math formulation for supervised learning Given training data x i, y i
More informationFoundations For Learning in the Age of Big Data. Maria-Florina Balcan
Foundations For Learning in the Age of Big Data Maria-Florina Balcan Modern Machine Learning New applications Explosion of data Modern ML: New Learning Approaches Modern applications: massive amounts of
More informationCOMS 4771 Lecture Boosting 1 / 16
COMS 4771 Lecture 12 1. Boosting 1 / 16 Boosting What is boosting? Boosting: Using a learning algorithm that provides rough rules-of-thumb to construct a very accurate predictor. 3 / 16 What is boosting?
More informationLearning with Large Number of Experts: Component Hedge Algorithm
Learning with Large Number of Experts: Component Hedge Algorithm Giulia DeSalvo and Vitaly Kuznetsov Courant Institute March 24th, 215 1 / 3 Learning with Large Number of Experts Regret of RWM is O( T
More informationNotes on Latent Semantic Analysis
Notes on Latent Semantic Analysis Costas Boulis 1 Introduction One of the most fundamental problems of information retrieval (IR) is to find all documents (and nothing but those) that are semantically
More information