Efficient Spectral Methods for Learning Mixture Models!
|
|
- Willa Harrington
- 5 years ago
- Views:
Transcription
1 Efficient Spectral Methods for Learning Mixture Models Qingqing Huang 2016 February Laboratory for Information & Decision Systems Based on joint works with Munther Dahleh, Rong Ge, Sham Kakade, Greg Valiant. 1
2 2 Learning Data generation X() Learning Infer about the underlying rule θ (Estimation, approximation, property testing, optimization of f(θ) ) Challenge: Exploit our prior for structure of the underlying θ to design fast algorithm that uses as few as possible data X to achieve the target accuracy in learning θ Computation Complexity Sample Complexity dim()
3 3 Learning Mixture Models Samples Hidden variable H 2 {1,...,K} Observed variables X =(X 1,X 2,...,X M ) 2 X Marginal distribution of the observables is a superposition of simple distributions Pr(X) = KX k=1 Pr(H = k) {z } Pr(X H = k) {z } = ( #mixture components, mixing weights, conditional probabilities ) Given N i.i.d. samples of observable variables, estimate the model parameters b apple b
4 4 Examples of Mixture Models Gaussian Mixtures (GMMs) Cluster data points in space Topic Models (Bag of Words) Topic words in each document
5 5 Examples of Mixture Models Hidden Markov Models (HMM) Current state Past, current, future observations Super-Resolution Source Complex sinusoids
6 6 Learning Mixture Models Samples Hidden variable H 2 {1,...,K} Observed variables X =(X 1,X 2,...,X M ) 2 X Given N i.i.d. samples of observable variables, estimate the model parameters b What do we know about sample complexity and computation complexity? There exist exponential lower bounds for sample complexity (worst case analysis) Maximum Likelihood Estimation is non-convex optimization (EM is heuristics)
7 7 Challenges in Learning Mixture Models There exist exponential lower bounds for sample complexity (worst case analysis) Can we learn non-worst-cases with provably efficient algorithms? Maximum Likelihood Estimation is non-convex optimization (EM is heuristics) Can we achieve statistical efficiency with tractable computation?
8 8 Contribution / Outline of the talk There exist exponential lower bounds for sample complexity (worst case analysis) Can we learn non-worst-cases with provably efficient algorithms? ü Spectral algorithms for learning GMMs, HMMs, Super-resolution ü Go beyond worst cases with randomness in analysis and algorithm Maximum Likelihood Estimation is non-convex optimization (EM is heuristics) Can we achieve statistical efficiency with tractable computation? ü Denoising low rank probability matrix with linear sample complexity (minmax optimal)
9 9 Paradigm of Spectral Algorithms for Learning Mixture models Mixture of conditional distributions Samples X() N = Easy inverse problems Compute statistics: e.g. Moments, Marginal probability Spectral decomposition to separate the mixtures A B C = M Mixture of rank-one matrices/tensors
10 Paradigm of Spectral Algorithms for Learning Samples X() N finite Compute statistics Easy inverse problems b ba b B b C = c M Spectral decomposition A B C = M ü PCA, Spectral clustering, Subspace system ID, fit into this paradigm ü Problem specific algorithm design (what statistics M to use?) analysis (is the spectral decomposition stable?) 10
11 11 PART 1 Provably efficient spectral algorithms for learning mixture models 1.1 Learn GMMs: Go beyond worst cases by smoothed analysis 1.2 Learn HMMs: Go beyond worst cases by generic analysis 1.3 Super-resolution: Go beyond worst cases by randomized algorithm
12 Learn GMMs: Setup Cluster M-dim data points [k] x 2 R n mixture of k multivariate Gaussians data points in n-dimensional space Model Parameters: weights w i means µ (i) covariance matrices (i) x = N (µ (i), (i) ), i w i
13 Learn GMMs: Prior Works General case Moment matching method [Moitra&Valiant] [Belkin&Sinha] Poly(n, e O(k)k ) With restrictive assumptions on model parameters Mean vectors are well-separated Pair wise clustering [Dasgupta] [Vempala&Wang] Poly(n, k) Mean vectors of spherical Gaussians are linearly independent Moments tensor decomposition [Hsu&Kakade] Poly(n, k)
14 1.1 Learn GMMs: Worst case lower bound Can we learn every GMM instance to target accuracy in poly runtime and using poly samples? No Exponential dependence in k for worst cases. [Moitra&Valiant] Can we learn most GMM instances with poly algorithm? without restrictive assumptions on model parameters Yes Worst cases are not everywhere, and we can handle non-worst-cases 14
15 1.1 Learn GMMs: Smoothed Analysis Framework Escape from the worst cases For an arbitrary instance Nature perturbs the parameters with a small amount (ρ) of noise e Observe data generated by e, design and analyze algorithm for e Hope With high probability over nature s perturbation, any arbitrary instance escapes from the degenerate cases, and becomes well conditioned. ü Bridge worst case and average case algo analysis [Spielman&Teng] ü A stronger notion than generic analysis Our Goal: Given samples from perturbed GMM, learn the perturbed parameters with negligible failure probability over nature s perturbation 15
16 1.1 Learn GMMs Main Results Our algorithm learns the GMM parameters up to accuracy ε With fully polynomial time and sample complexity Assumption: data in high enough dimension Poly(n, k, 1/ ) n = (k 2 ) Under smoothed analysis: works with negligible failure probability Learning Mixture of Gaussians in High dimensions R. Ge, H, S. Kakade (STOC 2015) 16
17 1.1 Learn GMMs: Algorithmic Ideas Method of moments: match 4-th and 6-th order moments M 4 M 6 Key challenge: Moment tensors are not of low rank, but they have special structures X() b ba b B b C = c M A B C = M M 4 = E[x 4 ] M 6 = E[x 6 ] X 4 = X 6 = kx (i) (i), i=1 kx (i) (i) (i). i=1 Structured linear projection M 4 = F 4 (X 4 ) M 6 = F 6 (X 6 ) Moment tensors are structured linear projections of desired low rank tensors Delicate algorithm to invert the structured linear projections 17
18 Learn GMMs: Algorithmic Ideas Method of moments: match 4-th and 6-th order moments M 4 M 6 Key challenge: Moment tensors are not of low rank, but they have special structures X() b ba b B b C = c M A B C = M M 4 = E[x 4 ] M 6 = E[x 6 ] Why high dimension n & smoothed analysis help us to learn? We have many moment matching constraints with only low order moments # free parameters (kn 2 ) < #6-th moments (n 6 ) The randomness in nature s perturbation makes matrices/tensors well-conditioned Gaussian matrix X 2 R n m, with prob at least 1 O( n ) m(x) p n.
19 19 PART 1 Provably efficient spectral algorithms for learning mixture models 1.1 Learn GMMs: Go beyond worst cases by smoothed analysis 1.2 Learn HMMs: Go beyond worst cases by generic analysis 1.3 Super-resolution: Go beyond worst cases by randomized algorithm
20 1.2 Learn HMMs: Setup h t n h t 1 h t h t+1 h t+n Hidden state [k] x t n x t x t 1 x t+1 x t+n Observation [d] N = 2n+1 window size Transition probability matrix: Observation probabilities: Q 2 R k k O 2 R d k Given length-n sequences of observation, how to recover Our focus: How large the window size N needs to be? =(Q, O) 20
21 1.2 Learn HMMs: Hidden state [k] Observation [d] N = 2n+1 window size Hardness Results HMM is not efficiently PAC learnable, under noisy parity assumption Construct an instance with reduction to parity of noise Required window size N = (k), Algorithm Complexity is [Abe,Warmuth] [Kearns] (d k ) Excluding a measure 0 set in the parameter space of Our Result for all most all HMM instances, the required window size is =(Q, O) Spectral algo achieves sample complexity and runtime both N = (log d k) poly(d, k) Minimal Realization Problems for Hidden Markov Models H, R. Ge, S. Kakade, M. Dahleh (IEEE Transactions on Signal Processing, 2016) 21
22 1.2 Learn HMMs: Algorithmic idea b X() ba B b C b = M c A B C = M M =Pr (x n 1,...,x 1 ),x 0, (x 1,...,x n ) 2 R dn d n d 1. is a low rank tensor of rank k M A =Pr x 1,x 2,...,x n h 0 2. Extract Q, O from tensor factors A B B =Pr x 1,x 2,...,x n h 0 C =Pr x 0,h 0 A =(O (O (O {z...(o O Q)...)Q)Q)Q, } {z } n n eq)...) Q) e Q) e Q e, B =(O (O (O {z...(o O } n {z } n Key challenge: C = Odiag( ) How large window size needs to be, so that we have unique tensor decomp Our careful generic analysis: If N = (log d k), worst cases all lie in a measure 0 set 22
23 23 PART 1 Provably efficient spectral algorithms for learning mixture models 1.1 Learn GMMs: Go beyond worst cases by smoothed analysis 1.2 Learn HMMs: Go beyond worst cases by generic analysis 1.3 Super-resolution: Go beyond worst cases by randomized algorithm
24 1.3 Super-Resolution: Setup Band-limited X() * = Take Fourier measurement of the band-limited observation How to recover the point sources with coarse measurement of the signal? ü small number of Fourier measurements ü Low cutoff frequency 24
25 Super-Resolution: Problem Formulation ü Recover point sources (a mixture of k vectors in d-dimensional space) x(t) = P k j=1 w j µ (j). Assume minimum separation =min j6=j 0 kµ (j) µ (j0) k 2 ü Use bandlimited and noisy Fourier measurements. ef(s) = ksk 1 apple cuto freq kx j=1 w j e i <µ(j),s> + z(s) bounded noise z(s) apple z, 8s ü Achieve target accuracy kbµ (j) µ (j) k 2 apple, 8j 2 [k]
26 Super-Resolution: kx ef(s) = w j e i <µ(j),s> + z(s) j=1 Prior Works =min j6=j 0 kµ (j) µ (j0) k 2 1-dimensional µ (j) Take uniform measurements on the grid s 2 { N,..., 1, 0, 1,...,N} SDP algorithm with cut-off frequency N = ( 1 ) [Candes, Fernandez-Granda] Hardness result N> C [Moitra] k log(k) One can use random measurements to recover 2N measurements [Tang, Bhaskar, Shah, Recht] d-dimensional µ (j) Mult-dim grid Algorithm complexity s 2 { N,..., 1, 0, 1,...,N} d O poly(k, 1 ) d
27 1.3 Super-Resolution: Main Result Our algorithm achieves stable recovery using a number of O((k + d) 2 ) Fourier measurements cutoff freq of the measurements bounded by algorithm runtime O((k + d) 3 ) O(1/ ) algorithm works with negligible failure probability cuto freq measurements runtime SDP C d 1 ( 1 1 )d poly(( 1 1 )d,k) MP Ours log(kd) (k log(k)+d) 2 (k log(k)+d) 2 Super-Resolution off the Grid H, S. Kakade (NIPS 2015) 27
28 1.3 Super-Resolution: ef(s) = Algorithmic Idea kx w j e i <µ(j),s> + z(s) j=1 X() ba b B b C = c M A B C = M ü Tensor decomposition with measurements on random frequencies d ü Random samples s such that F admits particular low rank decomp b V S = F = V S 0 V S 0 (V 2 D w ), e i <µ(1),s (1)>... e i <µ(k),s (1) > e i <µ(1),s (2)>... e i <µ(k),s (2) >..... e i <µ(1),s (m)>... e i <µ(k),s (m) > (Rank-k 3-way tensor) d d 2 (Vandermonde Matrix with complex nodes) d k ü Skip intermediate step of recovering (N d ) ü Prony s method (Matrix-Pencil / MUSIC / ) is just choosing measurements on the hyper-grid s deterministically 28
29 Super-Resolution: Algorithmic Idea ² Tensor decomposition with measurements on random frequencies ² Why we do not contradict the hardness result? O((k + d) 2 ) ü If we design a fixed grid of to take measurements f(s) there always exists model instances such that the particular grid fails s s ü Let the locations of be random (with structure for tensor decomp) then for any model instance, algo works with high probability vs O poly(k, 1 ) d
30 PART 2 Can we achieve optimal sample complexity in a tractable way? Estimate low rank probability matrices with linear sample complexity This problem is at the core of many spectral algorithms We capitalize the insights from community detection to solve it 30
31 2. Low rank matrix Setup X Probability Matrix B 2 R M M + N i.i.d. samples (distribution over M 2 outcomes) ( freq counts over outcomes) M 2 B is of rank 2: B = pp > + qq > B = Poisson(NB) M = N = B.10 p.20 q B Goal: find rank-2 bb such that kb b Bk 1 apple N sample complexity: upper bound algorithm, lower bound 31
32 2. Low rank matrix Connection to mixture models Topic model Pr(word 1, word 2 topic = T 1 )=pp > Pr(word 1, word 2 topic = T 2 )=qq > B joint distribution over word pairs HMM Pr(output 1, output 2 state = S i )=O i (OQ i ) > B distribution of consecutive outputs N data samples Extract parameters estimates empirical counts find low rank close to B b B B 32
33 2. Low rank matrix Sub-Optimal Attempt Probability Matrix B 2 R M M + B is of rank 2: B = > > + X N i.i.d. samples B = Poisson(NB) MLE is non-convex optimization L let s try something spectral J 33
34 34 2. Low rank matrix Sub-Optimal Attempt Probability Matrix B 2 R M M + B is of rank 2: B = > > + X N i.i.d. samples B = Poisson(NB) 1 N B = 1 N Poisson(NB) B, as N 1 Set bb to be the rank 2 truncated SVD of 1 N B To achieve accuracy kb b Bk 1 apple need N = (M 2 log M) Not sample efficient Hopefully N = (M) Small data in practice Word distribution in language has fat tail. More sample documents N, larger the vocabulary size M
35 2. Low rank matrix Main Result Our upper bound algorithm: Rank-2 estimate with accuracy Using N = O(M/ 2 ) number of sample counts Runtime O(M 3 ) We prove (strong) lower bound: bb k b B Bk 1 apple Lead to improved spectral algorithms for learning (M) Need a sequence of observations to test whether the sequence is i.i.d. of unif (M) or generated by a 2-state HMM Testing property is no easier than estimating? 8 > 0 Recovering Structured Probability Matrices H, S. Kakade, W. Kong, G. Valiant, (submitted to STOC 2016) 35
36 2. Low rank matrix Algorithmic Idea We capitalize the idea of community detection in sparse random network, which is a special case of our problem formulation M nodes 2 communities Expected connection Adjacency matrix B = pp > + qq > B = Bernoulli(NB) generate estimate B p q B 36
37 37 2. Low rank matrix Algorithmic Idea We capitalize the idea of community detection in sparse random network, which is a special case of our problem formulation M nodes 2 communities Expected connection Adjacency matrix B = pp > + qq > B = Bernoulli(NB) Regularize Truncated SVD: [Le, Levina, Vershynin] remove heavy row/column from B, run rank-2 SVD on the remaining graph B p q generate estimate B
38 38 2. Low rank matrix Algorithmic Idea We capitalize the idea of community detection in sparse random network, which is a special case of our problem formulation M nodes 2 communities Expected connection Adjacency matrix B = pp > + qq > B = Bernoulli(NB) Regularize Truncated SVD: [Le, Levina, Vershynin] remove heavy row/column from B, run rank-2 SVD on the remaining graph M M matrix Probability matrix Sample counts B = > + > B = Poisson(NB) Key Challenge: In our general setup, we do not have homogeneous marginal probabilities
39 2. Low rank matrix Algorithmic Idea 1, Binning We group words according to the empirical marginal probability, divide the matrix to blocks, then apply regularized t-svd to each block Phase 1 1. Estimate non-uniform marginal b 2. Bin M words according to b i 3. Regularized t-svd in each bin bin block of B to estimate Key Challenges: ü Binning is not exact, we need to deal with spillover ü We need to piece together estimates over bins 39
40 40 2. Low rank matrix Algorithmic Idea 2, Refinement The coarse estimation from Phase 1 gives some global information Make use of that to do local refinement for each row / column Phase 2 1. Phase 1 gives coarse estimate for many words 2. Refine the estimate for each word use linear regression 3. Achieve sample complexity N = O(M/ 2 )
41 41 Conclusion Spectral methods are powerful tools for learning mixture models. We can go beyond worst case analysis by exploiting the randomness in the analysis / algorithm. To make spectral algorithms more practical, one needs careful algorithm implementation to improve sample complexity.
42 42 Future works Data generation X() Learning Robustness: Agnostic learning, generalization error analysis Dynamics: Extend the analysis techniques and algorithmic ideas to learning of random processes, with streaming data, iterative algorithms X() Get hands dirty with real
43 43 References Learning Mixture of Gaussians in High dimensions R. Ge, H, S. Kakade (STOC 2015) Super-Resolution off the Grid H, S. Kakade (NIPS 2015) Minimal Realization Problems for Hidden Markov Models H, R. Ge, S. Kakade, M. Dahleh (IEEE Transactions on Signal Processing, 2016) Recovering Structured Probability Matrices H, S. Kakade, W. Kong, G. Valiant, (submitted to STOC 2016)
44 Tensor Decomposition Multi-way array in matlab 2-way tensor =matrix 3-way tensor: M j1,j 2,j 3, j 1 2 [n A ],j 2 2 [n B ],j 3 2 [n C ] 45
45 46 Tensor Decomposition Sum of rank one tensors [a b c] j1,j 2,j 3 = a j1 b j2 c j3 M = kx i=1 A [:,i] B [:,i] C [:,i] = A B C Tensor rank: minimum number of summands in a rank decomposition kx = = i=1
46 47 Tensor Decomposition M = kx i=1 A [:,i] B [:,i] C [:,i] = A B C Necessary condition for unique tensor decomposition If A and B are of full rank k, and C has rank 2 we can decompose M to uniquely find the factors A B C in poly time and stability depends poly on condition number of A B C (the algorithm boils down to matrix SVD)
Learning Structured Probability Matrices!
Learning Structured Probability Matrices Qingqing Huang 2016 February Laboratory for Information & Decision Systems Based on joint work with Sham Kakade, Weihao Kong and Greg Valiant. 1 2 Learning Data
More informationNon-convex Robust PCA: Provable Bounds
Non-convex Robust PCA: Provable Bounds Anima Anandkumar U.C. Irvine Joint work with Praneeth Netrapalli, U.N. Niranjan, Prateek Jain and Sujay Sanghavi. Learning with Big Data High Dimensional Regime Missing
More informationComputational Lower Bounds for Statistical Estimation Problems
Computational Lower Bounds for Statistical Estimation Problems Ilias Diakonikolas (USC) (joint with Daniel Kane (UCSD) and Alistair Stewart (USC)) Workshop on Local Algorithms, MIT, June 2018 THIS TALK
More informationECE 598: Representation Learning: Algorithms and Models Fall 2017
ECE 598: Representation Learning: Algorithms and Models Fall 2017 Lecture 1: Tensor Methods in Machine Learning Lecturer: Pramod Viswanathan Scribe: Bharath V Raghavan, Oct 3, 2017 11 Introduction Tensors
More informationDonald Goldfarb IEOR Department Columbia University UCLA Mathematics Department Distinguished Lecture Series May 17 19, 2016
Optimization for Tensor Models Donald Goldfarb IEOR Department Columbia University UCLA Mathematics Department Distinguished Lecture Series May 17 19, 2016 1 Tensors Matrix Tensor: higher-order matrix
More informationRobust Principal Component Analysis
ELE 538B: Mathematics of High-Dimensional Data Robust Principal Component Analysis Yuxin Chen Princeton University, Fall 2018 Disentangling sparse and low-rank matrices Suppose we are given a matrix M
More informationA Robust Form of Kruskal s Identifiability Theorem
A Robust Form of Kruskal s Identifiability Theorem Aditya Bhaskara (Google NYC) Joint work with Moses Charikar (Princeton University) Aravindan Vijayaraghavan (NYU Northwestern) Background: understanding
More informationRandomized Algorithms
Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models
More informationEstimating Covariance Using Factorial Hidden Markov Models
Estimating Covariance Using Factorial Hidden Markov Models João Sedoc 1,2 with: Jordan Rodu 3, Lyle Ungar 1, Dean Foster 1 and Jean Gallier 1 1 University of Pennsylvania Philadelphia, PA joao@cis.upenn.edu
More informationTHE HIDDEN CONVEXITY OF SPECTRAL CLUSTERING
THE HIDDEN CONVEXITY OF SPECTRAL CLUSTERING Luis Rademacher, Ohio State University, Computer Science and Engineering. Joint work with Mikhail Belkin and James Voss This talk A new approach to multi-way
More informationGoing off the grid. Benjamin Recht Department of Computer Sciences University of Wisconsin-Madison
Going off the grid Benjamin Recht Department of Computer Sciences University of Wisconsin-Madison Joint work with Badri Bhaskar Parikshit Shah Gonnguo Tang We live in a continuous world... But we work
More informationELE 538B: Sparsity, Structure and Inference. Super-Resolution. Yuxin Chen Princeton University, Spring 2017
ELE 538B: Sparsity, Structure and Inference Super-Resolution Yuxin Chen Princeton University, Spring 2017 Outline Classical methods for parameter estimation Polynomial method: Prony s method Subspace method:
More informationEstimating Latent Variable Graphical Models with Moments and Likelihoods
Estimating Latent Variable Graphical Models with Moments and Likelihoods Arun Tejasvi Chaganty Percy Liang Stanford University June 18, 2014 Chaganty, Liang (Stanford University) Moments and Likelihoods
More informationLecture 21: Spectral Learning for Graphical Models
10-708: Probabilistic Graphical Models 10-708, Spring 2016 Lecture 21: Spectral Learning for Graphical Models Lecturer: Eric P. Xing Scribes: Maruan Al-Shedivat, Wei-Cheng Chang, Frederick Liu 1 Motivation
More informationCombining Sparsity with Physically-Meaningful Constraints in Sparse Parameter Estimation
UIUC CSL Mar. 24 Combining Sparsity with Physically-Meaningful Constraints in Sparse Parameter Estimation Yuejie Chi Department of ECE and BMI Ohio State University Joint work with Yuxin Chen (Stanford).
More informationROBUST BLIND SPIKES DECONVOLUTION. Yuejie Chi. Department of ECE and Department of BMI The Ohio State University, Columbus, Ohio 43210
ROBUST BLIND SPIKES DECONVOLUTION Yuejie Chi Department of ECE and Department of BMI The Ohio State University, Columbus, Ohio 4 ABSTRACT Blind spikes deconvolution, or blind super-resolution, deals with
More informationAn Introduction to Spectral Learning
An Introduction to Spectral Learning Hanxiao Liu November 8, 2013 Outline 1 Method of Moments 2 Learning topic models using spectral properties 3 Anchor words Preliminaries X 1,, X n p (x; θ), θ = (θ 1,
More informationFourier PCA. Navin Goyal (MSR India), Santosh Vempala (Georgia Tech) and Ying Xiao (Georgia Tech)
Fourier PCA Navin Goyal (MSR India), Santosh Vempala (Georgia Tech) and Ying Xiao (Georgia Tech) Introduction 1. Describe a learning problem. 2. Develop an efficient tensor decomposition. Independent component
More informationLinear Regression and Its Applications
Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Multivariate Gaussians Mark Schmidt University of British Columbia Winter 2019 Last Time: Multivariate Gaussian http://personal.kenyon.edu/hartlaub/mellonproject/bivariate2.html
More informationBaum s Algorithm Learns Intersections of Halfspaces with respect to Log-Concave Distributions
Baum s Algorithm Learns Intersections of Halfspaces with respect to Log-Concave Distributions Adam R Klivans UT-Austin klivans@csutexasedu Philip M Long Google plong@googlecom April 10, 2009 Alex K Tang
More informationapproximation algorithms I
SUM-OF-SQUARES method and approximation algorithms I David Steurer Cornell Cargese Workshop, 201 meta-task encoded as low-degree polynomial in R x example: f(x) = i,j n w ij x i x j 2 given: functions
More informationCS534 Machine Learning - Spring Final Exam
CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the
More informationTowards a Mathematical Theory of Super-resolution
Towards a Mathematical Theory of Super-resolution Carlos Fernandez-Granda www.stanford.edu/~cfgranda/ Information Theory Forum, Information Systems Laboratory, Stanford 10/18/2013 Acknowledgements This
More informationIntroduction to the Tensor Train Decomposition and Its Applications in Machine Learning
Introduction to the Tensor Train Decomposition and Its Applications in Machine Learning Anton Rodomanov Higher School of Economics, Russia Bayesian methods research group (http://bayesgroup.ru) 14 March
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project
More informationSketching for Large-Scale Learning of Mixture Models
Sketching for Large-Scale Learning of Mixture Models Nicolas Keriven Université Rennes 1, Inria Rennes Bretagne-atlantique Adv. Rémi Gribonval Outline Introduction Practical Approach Results Theoretical
More informationAppendix A. Proof to Theorem 1
Appendix A Proof to Theorem In this section, we prove the sample complexity bound given in Theorem The proof consists of three main parts In Appendix A, we prove perturbation lemmas that bound the estimation
More informationGuaranteed Learning of Latent Variable Models through Spectral and Tensor Methods
Guaranteed Learning of Latent Variable Models through Spectral and Tensor Methods Anima Anandkumar U.C. Irvine Application 1: Clustering Basic operation of grouping data points. Hypothesis: each data point
More informationHigh-dimensional graphical model selection: Practical and information-theoretic limits
1 High-dimensional graphical model selection: Practical and information-theoretic limits Martin Wainwright Departments of Statistics, and EECS UC Berkeley, California, USA Based on joint work with: John
More informationClustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning
Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades
More informationIntroduction to Machine Learning Midterm, Tues April 8
Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend
More informationSelf-Calibration and Biconvex Compressive Sensing
Self-Calibration and Biconvex Compressive Sensing Shuyang Ling Department of Mathematics, UC Davis July 12, 2017 Shuyang Ling (UC Davis) SIAM Annual Meeting, 2017, Pittsburgh July 12, 2017 1 / 22 Acknowledgements
More information8.1 Concentration inequality for Gaussian random matrix (cont d)
MGMT 69: Topics in High-dimensional Data Analysis Falll 26 Lecture 8: Spectral clustering and Laplacian matrices Lecturer: Jiaming Xu Scribe: Hyun-Ju Oh and Taotao He, October 4, 26 Outline Concentration
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More informationSolving Corrupted Quadratic Equations, Provably
Solving Corrupted Quadratic Equations, Provably Yuejie Chi London Workshop on Sparse Signal Processing September 206 Acknowledgement Joint work with Yuanxin Li (OSU), Huishuai Zhuang (Syracuse) and Yingbin
More informationBrief Introduction of Machine Learning Techniques for Content Analysis
1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview
More informationRobust Statistics, Revisited
Robust Statistics, Revisited Ankur Moitra (MIT) joint work with Ilias Diakonikolas, Jerry Li, Gautam Kamath, Daniel Kane and Alistair Stewart CLASSIC PARAMETER ESTIMATION Given samples from an unknown
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationRobust Spectral Inference for Joint Stochastic Matrix Factorization
Robust Spectral Inference for Joint Stochastic Matrix Factorization Kun Dong Cornell University October 20, 2016 K. Dong (Cornell University) Robust Spectral Inference for Joint Stochastic Matrix Factorization
More informationEfficient algorithms for estimating multi-view mixture models
Efficient algorithms for estimating multi-view mixture models Daniel Hsu Microsoft Research, New England Outline Multi-view mixture models Multi-view method-of-moments Some applications and open questions
More informationOn the recovery of measures without separation conditions
Norbert Wiener Center Department of Mathematics University of Maryland, College Park http://www.norbertwiener.umd.edu Applied and Computational Mathematics Seminar Georgia Institute of Technology October
More informationGaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008
Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:
More informationProvable Alternating Minimization Methods for Non-convex Optimization
Provable Alternating Minimization Methods for Non-convex Optimization Prateek Jain Microsoft Research, India Joint work with Praneeth Netrapalli, Sujay Sanghavi, Alekh Agarwal, Animashree Anandkumar, Rashish
More informationCSC 576: Variants of Sparse Learning
CSC 576: Variants of Sparse Learning Ji Liu Department of Computer Science, University of Rochester October 27, 205 Introduction Our previous note basically suggests using l norm to enforce sparsity in
More informationsublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU)
sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU) 0 overview Our Contributions: 1 overview Our Contributions: A near optimal low-rank
More informationUnsupervised Learning with Permuted Data
Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University
More informationOptimization-based sparse recovery: Compressed sensing vs. super-resolution
Optimization-based sparse recovery: Compressed sensing vs. super-resolution Carlos Fernandez-Granda, Google Computational Photography and Intelligent Cameras, IPAM 2/5/2014 This work was supported by a
More informationMassive MIMO: Signal Structure, Efficient Processing, and Open Problems II
Massive MIMO: Signal Structure, Efficient Processing, and Open Problems II Mahdi Barzegar Communications and Information Theory Group (CommIT) Technische Universität Berlin Heisenberg Communications and
More informationA randomized algorithm for approximating the SVD of a matrix
A randomized algorithm for approximating the SVD of a matrix Joint work with Per-Gunnar Martinsson (U. of Colorado) and Vladimir Rokhlin (Yale) Mark Tygert Program in Applied Mathematics Yale University
More informationTensor Methods for Feature Learning
Tensor Methods for Feature Learning Anima Anandkumar U.C. Irvine Feature Learning For Efficient Classification Find good transformations of input for improved classification Figures used attributed to
More informationRecovering Social Networks by Observing Votes
Recovering Social Networks by Observing Votes Lev Reyzin work joint with Ben Fish and Yi Huang UIC, Department of Mathematics talk at ITA paper to appear in AAMAS 2016 February 2016 Set-up Hidden G: Set-up
More informationMixtures of Gaussians. Sargur Srihari
Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm
More informationCS264: Beyond Worst-Case Analysis Lecture #15: Topic Modeling and Nonnegative Matrix Factorization
CS264: Beyond Worst-Case Analysis Lecture #15: Topic Modeling and Nonnegative Matrix Factorization Tim Roughgarden February 28, 2017 1 Preamble This lecture fulfills a promise made back in Lecture #1,
More informationSum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017
Sum-Product Networks STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017 Introduction Outline What is a Sum-Product Network? Inference Applications In more depth
More informationMaximum Likelihood Estimation for Mixtures of Spherical Gaussians is NP-hard
Journal of Machine Learning Research 18 018 1-11 Submitted 1/16; Revised 1/16; Published 4/18 Maximum Likelihood Estimation for Mixtures of Spherical Gaussians is NP-hard Christopher Tosh Sanjoy Dasgupta
More informationBased on slides by Richard Zemel
CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we
More informationOrthogonal tensor decomposition
Orthogonal tensor decomposition Daniel Hsu Columbia University Largely based on 2012 arxiv report Tensor decompositions for learning latent variable models, with Anandkumar, Ge, Kakade, and Telgarsky.
More informationLIMITATION OF LEARNING RANKINGS FROM PARTIAL INFORMATION. By Srikanth Jagabathula Devavrat Shah
00 AIM Workshop on Ranking LIMITATION OF LEARNING RANKINGS FROM PARTIAL INFORMATION By Srikanth Jagabathula Devavrat Shah Interest is in recovering distribution over the space of permutations over n elements
More informationThe Informativeness of k-means for Learning Mixture Models
The Informativeness of k-means for Learning Mixture Models Vincent Y. F. Tan (Joint work with Zhaoqiang Liu) National University of Singapore June 18, 2018 1/35 Gaussian distribution For F dimensions,
More informationSuper-resolution via Convex Programming
Super-resolution via Convex Programming Carlos Fernandez-Granda (Joint work with Emmanuel Candès) Structure and Randomness in System Identication and Learning, IPAM 1/17/2013 1/17/2013 1 / 44 Index 1 Motivation
More informationTime Series 2. Robert Almgren. Sept. 21, 2009
Time Series 2 Robert Almgren Sept. 21, 2009 This week we will talk about linear time series models: AR, MA, ARMA, ARIMA, etc. First we will talk about theory and after we will talk about fitting the models
More informationA Practical Algorithm for Topic Modeling with Provable Guarantees
1 A Practical Algorithm for Topic Modeling with Provable Guarantees Sanjeev Arora, Rong Ge, Yoni Halpern, David Mimno, Ankur Moitra, David Sontag, Yichen Wu, Michael Zhu Reviewed by Zhao Song December
More informationAnalysis of Robust PCA via Local Incoherence
Analysis of Robust PCA via Local Incoherence Huishuai Zhang Department of EECS Syracuse University Syracuse, NY 3244 hzhan23@syr.edu Yi Zhou Department of EECS Syracuse University Syracuse, NY 3244 yzhou35@syr.edu
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationNote Set 5: Hidden Markov Models
Note Set 5: Hidden Markov Models Probabilistic Learning: Theory and Algorithms, CS 274A, Winter 2016 1 Hidden Markov Models (HMMs) 1.1 Introduction Consider observed data vectors x t that are d-dimensional
More information26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G.
10-708: Probabilistic Graphical Models, Spring 2015 26 : Spectral GMs Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G. 1 Introduction A common task in machine learning is to work with
More informationSparse Gaussian Markov Random Field Mixtures for Anomaly Detection
Sparse Gaussian Markov Random Field Mixtures for Anomaly Detection Tsuyoshi Idé ( Ide-san ), Ankush Khandelwal*, Jayant Kalagnanam IBM Research, T. J. Watson Research Center (*Currently with University
More informationOn Learning Mixtures of Well-Separated Gaussians
On Learning Mixtures of Well-Separated Gaussians Oded Regev Aravindan Viayaraghavan Abstract We consider the problem of efficiently learning mixtures of a large number of spherical Gaussians, when the
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationCS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas
CS839: Probabilistic Graphical Models Lecture 7: Learning Fully Observed BNs Theo Rekatsinas 1 Exponential family: a basic building block For a numeric random variable X p(x ) =h(x)exp T T (x) A( ) = 1
More informationFast and Robust Phase Retrieval
Fast and Robust Phase Retrieval Aditya Viswanathan aditya@math.msu.edu CCAM Lunch Seminar Purdue University April 18 2014 0 / 27 Joint work with Yang Wang Mark Iwen Research supported in part by National
More informationDisentangling Gaussians
Disentangling Gaussians Ankur Moitra, MIT November 6th, 2014 Dean s Breakfast Algorithmic Aspects of Machine Learning 2015 by Ankur Moitra. Note: These are unpolished, incomplete course notes. Developed
More informationGraph Partitioning Using Random Walks
Graph Partitioning Using Random Walks A Convex Optimization Perspective Lorenzo Orecchia Computer Science Why Spectral Algorithms for Graph Problems in practice? Simple to implement Can exploit very efficient
More informationLearning symmetric non-monotone submodular functions
Learning symmetric non-monotone submodular functions Maria-Florina Balcan Georgia Institute of Technology ninamf@cc.gatech.edu Nicholas J. A. Harvey University of British Columbia nickhar@cs.ubc.ca Satoru
More informationConvex relaxation for Combinatorial Penalties
Convex relaxation for Combinatorial Penalties Guillaume Obozinski Equipe Imagine Laboratoire d Informatique Gaspard Monge Ecole des Ponts - ParisTech Joint work with Francis Bach Fête Parisienne in Computation,
More informationThe Expectation-Maximization Algorithm
1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable
More informationLatent Variable Models and EM Algorithm
SC4/SM8 Advanced Topics in Statistical Machine Learning Latent Variable Models and EM Algorithm Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/atsml/
More informationDISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania
Submitted to the Annals of Statistics DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING By T. Tony Cai and Linjun Zhang University of Pennsylvania We would like to congratulate the
More informationSTA414/2104. Lecture 11: Gaussian Processes. Department of Statistics
STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations
More informationSparse Parameter Estimation: Compressed Sensing meets Matrix Pencil
Sparse Parameter Estimation: Compressed Sensing meets Matrix Pencil Yuejie Chi Departments of ECE and BMI The Ohio State University Colorado School of Mines December 9, 24 Page Acknowledgement Joint work
More informationRecoverabilty Conditions for Rankings Under Partial Information
Recoverabilty Conditions for Rankings Under Partial Information Srikanth Jagabathula Devavrat Shah Abstract We consider the problem of exact recovery of a function, defined on the space of permutations
More information10-701/15-781, Machine Learning: Homework 4
10-701/15-781, Machine Learning: Homewor 4 Aarti Singh Carnegie Mellon University ˆ The assignment is due at 10:30 am beginning of class on Mon, Nov 15, 2010. ˆ Separate you answers into five parts, one
More informationMinimal Realization Problems for Hidden Markov Models
Minimal Realization Problems for Hidden Markov Models Qingqing Huang, Rong Ge, Sham Kakade, Munther Dahleh 1 arxiv:14113698v2 [cslg] 14 Dec 2015 Abstract This paper addresses two fundamental problems in
More informationFirst Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate
58th Annual IEEE Symposium on Foundations of Computer Science First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate Zeyuan Allen-Zhu Microsoft Research zeyuan@csail.mit.edu
More informationBAYESIAN DECISION THEORY
Last updated: September 17, 2012 BAYESIAN DECISION THEORY Problems 2 The following problems from the textbook are relevant: 2.1 2.9, 2.11, 2.17 For this week, please at least solve Problem 2.3. We will
More informationFinal Exam, Machine Learning, Spring 2009
Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3
More information6.842 Randomness and Computation Lecture 5
6.842 Randomness and Computation 2012-02-22 Lecture 5 Lecturer: Ronitt Rubinfeld Scribe: Michael Forbes 1 Overview Today we will define the notion of a pairwise independent hash function, and discuss its
More informationHigh-dimensional graphical model selection: Practical and information-theoretic limits
1 High-dimensional graphical model selection: Practical and information-theoretic limits Martin Wainwright Departments of Statistics, and EECS UC Berkeley, California, USA Based on joint work with: John
More informationSum-of-Squares Method, Tensor Decomposition, Dictionary Learning
Sum-of-Squares Method, Tensor Decomposition, Dictionary Learning David Steurer Cornell Approximation Algorithms and Hardness, Banff, August 2014 for many problems (e.g., all UG-hard ones): better guarantees
More informationHigh dimensional Ising model selection
High dimensional Ising model selection Pradeep Ravikumar UT Austin (based on work with John Lafferty, Martin Wainwright) Sparse Ising model US Senate 109th Congress Banerjee et al, 2008 Estimate a sparse
More informationLinear Dynamical Systems
Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations
More informationRobustness Meets Algorithms
Robustness Meets Algorithms Ankur Moitra (MIT) ICML 2017 Tutorial, August 6 th CLASSIC PARAMETER ESTIMATION Given samples from an unknown distribution in some class e.g. a 1-D Gaussian can we accurately
More informationComputational Complexity of Bayesian Networks
Computational Complexity of Bayesian Networks UAI, 2015 Complexity theory Many computations on Bayesian networks are NP-hard Meaning (no more, no less) that we cannot hope for poly time algorithms that
More informationhe Applications of Tensor Factorization in Inference, Clustering, Graph Theory, Coding and Visual Representation
he Applications of Tensor Factorization in Inference, Clustering, Graph Theory, Coding and Visual Representation Amnon Shashua School of Computer Science & Eng. The Hebrew University Matrix Factorization
More informationMachine Learning for Signal Processing Bayes Classification and Regression
Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For
More informationClustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.
Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)
More informationApplications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices
Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices Vahid Dehdari and Clayton V. Deutsch Geostatistical modeling involves many variables and many locations.
More informationCS281 Section 4: Factor Analysis and PCA
CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we
More informationFinal Exam, Fall 2002
15-781 Final Exam, Fall 22 1. Write your name and your andrew email address below. Name: Andrew ID: 2. There should be 17 pages in this exam (excluding this cover sheet). 3. If you need more room to work
More information