Learning with Matrix Factorizations
|
|
- Katrina Robertson
- 5 years ago
- Views:
Transcription
1 Learning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology Adviser: Tommi Jaakkola (MIT) Committee: Alan Willsky (MIT), Tali Tishby (HUJI), Josh Tenenbaum (MIT) Joint work with Shie Mannor (McGill), Noga Alon (TAU) and Jason Rennie (MIT) Lots of help from John Barnett, Karen Livescu, Adi Shraibman, Michael Collins, Erik Demaine, Michel Goemans, David Karger and Dmitry Panchenko
2 Dimensionality Reduction: Low dimensional representation capturing important aspects of high dimensional data observations y image gene expression levels a user s preferences document small representation u varying aspects of scene description of biological process characterization of user topics Compression (mostly to reduce processing time) Reconstructing latent signal biological processes through gene expression Capturing structure in a corpus documents, images, etc Prediction: collaborative filtering
3 Linear Dimensionality Reduction y u1 v 1 + u2 v 2 + u3 v 3
4 Linear Dimensionality Reduction titles titles y u1 +1 v 1 comic value + u2 +2 v 2 dramatic value + u3-1 v 3 violence preferences of a specific user (real-valued preference level for each title) characteristics of the user
5 Linear Dimensionality Reduction titles y u titles V
6 Linear Dimensionality Reduction users titles y Y data u U titles V
7 Matrix Factorization Y data U V = X rank k Non-Negativity [LeeSeung99] Stochasticity (convexity) [LeeSeung97][Barnett04] Sparsity Clustering as an extreme (when rows of U sparse) Unconstrained: Low Rank Approximation
8 Matrix Factorization Y data U V = + Z i.i.d. Gaussian Additive Gaussian noise minimize Y-UV Fro ( ) 1 = = 2 '; Y log P( Y UV ' ) ( Y UV ' ) + log L UV 2 ij ij ij 2σ ij ij ij const
9 Matrix Factorization Y data U V = + Z i.i.d. p(z ij ) Additive Gaussian noise minimize Y-UV Fro General additive noise minimize -log p(y ij -UV ij )
10 Additive Gaussian noise minimize Y-UV Fro General additive noise minimize -log p(y ij -UV ij ) General conditional models minimize -log p(y ij UV ij ) Multiplicative noise Binary observations: Logistic LRA Exponential PCA [Collins+01] Multinomial (plsa [Hofman01]) Other loss functions [Gordon02] minimize loss(uv ;Y ) p(y ij UV ij ) Matrix Factorization Y data U V
11 topic 1..k T I The Aspect Model (plsa) J Y co-occurrence occurrence counts word X p(i,j) rank k = p(i,t) [Hoffman+99] p(j t) document Y X Multinomial(N,X) Y ij X ij Binomial(N,X ij ) N= Y ij
12 Low-Rank Models for Matrices of Counts, Occurrences or Frequencies Multinomial Independent Binomials Independent Bernoulli Mean parameterization 0 X ij 1 E[Y ij X ij ]=X ij Natural parameterization unconstrained X ij Aspect Model (plsa) [Hoffman+99] Sufficient Dimensionality Reduction [Globerson+02] Y ij X ij ~Bin(N,X ij ) NMF if X ij =1 NMF [Lee+01] Y ij X ij ~Bin(N,g(X ij )) P(Y ij =1) = X ij Exponential PCA: [Collins+01] p(y ij X ij ) exp(y ij X ij +F(Y ij )) Logistic Low Rank Approximation [Schein+03] hinge loss g(x)=1/(1+e x )
13 Outline Finding Low Rank Approximations Weight Low Rank Approx: minimize ij W ij (Y ij -X ij ) 2 Use WLRA Basis for other losses / conditional models Consistency of Low Rank Approximation When more data is available, do we converge to correct solution? Not always Matrix Factorization for Collaborative Filtering Maximum Margin Matrix Factorization Generalization Error Bounds (Low Rank & MMMF)
14 Finding Low Rank Approximation Find rank-k X minimizing loss(x ij ;Y ij ) (=log p(y X)) Non-convex: X is rank-k is a non-convex constraint loss((uv) ij ;Y ij ) not convex in U,V rank-k X minimizing (Y ij -X ij ) 2 : non-convex, but no (non-global) local minima solution: leading components of SVD For other loss functions, or with missing data: cannot use SVD, local minima, difficult problem Weighted Low Rank Approximation: minimize W ij (Y ij -X ij ) 2 Arbitrarily specified weights (part of input)
15 WLRA: Optimization J ( UV ') = W ( Y UV ') ij ij 2 ij For fixed V, find optimal U For fixed U, find optimal V n k U m V k V J * ( V ) = minj ( UV ') U J* ( V ') = 2U*'(( U* V ' Y ) W ) Conjugate gradient descent on J* Optimize km parameters instead of k(n+m) EM approach: elementwise product X LowRankApprox(W Y+(1-W) X)
16 Newton Optimization for Non-Quadratic Loss Functions minimize ij loss(x ij ;Y ij ) loss(x ij ;Y ij ) convex in X ij Iteratively optimize quadratic approximations of objective Each such quadratic optimization is a weighted low rank approximation
17 Maximum Likelihood Estimation with Gaussian Mixture Noise Y observed = X + rank k Z noise C (hidden) Mixture of Gaussians: {p c,µ c,σ c } E step: calculate posteriors of C M step: WLRA with Pr( Cij = c ) Wij = 2 σ c C A Pr( C ij ij = Yij + 2 c σc = c ) µ C W ij
18 Outline Finding Low Rank Approximations Weight Low Rank Approx: minimize ij W ij (Y ij -X ij ) 2 Use WLRA Basis for other losses / conditional models Consistency of Low Rank Approximation When more data is available, do we converge to correct solution? Not always Matrix Factorization for Collaborative Filtering Maximum Margin Matrix Factorization Generalization Error Bounds (Low Rank & MMMF)
19 Asymptotic Behavior of Low Rank Approximation parameters random Y X = + rank k Z independent Single observation, Number of parameters is linear in number of observables Can never approach correct estimation of parameters What can be estimated is row-space of X
20 Y U V = + random Z independent Number of parameters is linear in number of observables What can be estimated is row-space of X
21 random V random Y = + U Z independent What can be estimated is row-space of X
22 y = u + V z Multiple samples of random variable y What can be estimated is row-space of X
23 Probabilistic PCA y = u + ~ Gaussian V z ~ Gaussian x ~ Gaussian Σ X rank k estimated parameters σ 2 I Maximum likelihood PCA [Tipping Bishop 97]
24 Probabilistic PCA [Tipping Bishop 97] y = u + ~ Gaussian V estimated parameters z ~ Gaussian(σ 2 I) Latent Dirichlet Allocation [Blei Ng Jordan 03] y ~ V Multinomial( N, u ) ~ Dirichlet(α) estimated parameters
25 Generative and Non-Generative Low Rank Models Y=X+Gaussian Probabilistic PCA plsa, Y~Binom(N,X) Non-parametric generative models Latent Dirichlet Allocatoin Parametric generative models: Consistency of Maximum Likelihood estimation guaranteed if model assumptions hold (both on Y X and on U)
26 estimated parameters row-space of V, not entries of V y = u + P U x V z ~ Gaussian nonparametric, unconstrained nuisance Non-parametric model, estimation of a parametric part of the model Maximum Likelihood Single Observed Y
27 Consistency of ML Estimation estimated V true V
28 Consistency of ML Estimation V expected contribution of x to likelihood of V x [ max log p (( x + z ) )] Ψ( V ; x ) = E uv ML estimator is consistent for any P u z u Z for all x, V maximizes iff V spans x Ψ( V ; x ) When Z~Gaussian, Ψ ( V ; x ) = E[L 2 distance of x+z from V] For iid Gaussian noise, ML estimation (PCA) of the low-rank sub-space is consistent
29 Consistency of ML Estimation V expected contribution of x to likelihood of V x=0 [ max log p (( x + z ) )] Ψ( V ; x ) = E uv ML estimator is consistent for any P u z u Z for all x, V maximizes iff V spans x Ψ( V ; x ) ML estimator is consistent for any P u Ψ(V ;0) is constant for all V
30 Consistency of ML Estimation V expected contribution of x to likelihood of V x=0 θ [ max log p (( x + z ) )] Ψ( V ; x ) = E z u Z uv 1 z [ i ] Laplace: p ( z [ i ]) = e Z vertical axis Ψ(V ;0) = 1 horizontal axis π θ= 2 π
31 Consistency of ML Estimation expected contribution of x to likelihood of V X=UV General conditional model for Y X [ max log p ( Y uv ) x] Ψ( V; x) = E X u Y ML estimator is consistent for any P u Y X for all x, V maximizes iff V spans x Ψ( V ; x ) ML estimator is consistent for any P u Ψ(V ;0) is constant for all V
32 Consistency of ML Estimation V expected contribution of x to likelihood of V x=0 θ [ max log p ( Y uv ) x] Ψ( V; x) = E X u Y Logistic: py X Y i = x Y X i 1 = 1 e ( [ ] 1 [ ]) x[ i] Ψ(V ;0) = θ= π π 2
33 Consistent Estimation Additive i.i.d. noise Y=X+Z: Maximum Likelihood generally not consistent PCA is consistent (for any noise distribution) Span of k leading eigenvectors of Σˆ Y Σ X + σ 2 I rank k
34 Consistent Estimation Additive i.i.d. noise Y=X+Z: Maximum Likelihood generally not consistent PCA is consistent (for any noise distribution) s 1, s 2,, s k, 0, 0,, 0 Σˆ Y Σ Y = Σ X + Σ Z = Σ X + σ 2 I s 1 +σ 2, s 2 +σ 2,, s k +σ 2, σ 2, σ 2,, σ 2
35 Consistent Estimation Additive i.i.d. noise Y=X+Z: Maximum Likelihood generally not consistent PCA is consistent (for any noise distribution) distance to correct subspace ML PCA Noise: 0.99 N(0,1) N(0,100) Sample size (number of observed rows)
36 Consistent Estimation Additive i.i.d. noise Y=X+Z: Maximum Likelihood generally not consistent PCA is consistent (for any noise distribution) Unbiased noise E[Y X]=X: Maximum Likelihood generally not consistent PCA not consistent -- Can correct by ignoring diagonal of covariance Exponential PCA (X are natural parameters) Maximum Likelihood generally not consistent Covariance methods not consistent???
37 Outline Finding Low Rank Approximations Weight Low Rank Approx: minimize ij W ij (Y ij -X ij ) 2 Use WLRA Basis for other losses / conditional models Consistency of Low Rank Approximation When more data is available, do we converge to correct solution? Not always Matrix Factorization for Collaborative Filtering Maximum Margin Matrix Factorization Generalization Error Bounds (Low Rank & MMMF)
38 Collaborative Filtering Based on preferences so far, and preferences of others: Predict further preferences users movies Implicit or explicit preferences? Type of queries.
39 Matrix Completion Based on partially observed matrix: Predict unobserved entries users movies ? ? 5 3? ? ? 5? ? 5 2? ? ? Will user i like movie j?
40 Matrix Completion with Matrix Factorization U V ? ? ? ? ? +1 5? ? ? ? ? Fit factorizable (lowrank) matrix X=UV to observed entries. minimize Σloss(X ij ;Y ij ) prediction observation Use matrix X to predict unobserved entries. [Sarwar+00] [Azar+01] [Hoffman04] [Marlin+04]
41 Matrix Completion with Matrix Factorization U v12 v9 v10 v11 v8 v7 v6 v5 v4 v3 v2 v When U is fixed, each row is a linear classification problem: rows of U are feature vectors columns of V are linear classifiers Fitting U and V: Learning features that work well across all classification problems.
42 Max-Margin Matrix Factorization low norm U V Instead of bounding dimensionality of U,V, bound norms of U,V For observed Y ij ±1: Y ij X ij Margin U i,v j
43 Max-Margin Matrix Factorization low norm U V bound norms on average: ( i U i 2 ) ( j V j 2 ) 1 bound norms uniformly: (max i U i 2 ) (max j V j 2 ) 1 For observed Y ij ±1: Y ij X ij Margin U i,v j U is fixed: each column of V is SVM
44 Max-Margin Matrix Factorization low norm U V bound norms on average: ( i U i 2 ) ( j V j 2 ) 1 bound norms uniformly: (max i U i 2 ) (max j V j 2 ) 1 For observed Y ij ±1: Y ij X ij Margin U i,v j U is fixed: each column of V is SVM
45 Geometric Interpretation V columns of V (items) rows of U (users) U (max i U i 2 ) (max j V j 2 ) 1
46 Geometric Interpretation V columns of V (items) rows of U (users) U (max i U i 2 ) (max j V j 2 ) 1
47 Finding Max-Margin Matrix Factorizations maximize M Y ij X ij M X=UV ( i U i 2 ) ( j V j 2 ) 1 maximize M Y ij X ij M X=UV (max i U i 2 ) (max j V j 2 ) 1 Unlike rank(x) k, these are convex constraints!
48 Finding Max-Margin Matrix Factorizations maximize M Y ij X ij M X=UV ( i U i 2 ) ( j V j 2 ) 1 X Y Q X tr = (singular values of X) minimize tr(a)+tr(b) + c ξ ij A X Y ij X ij 1 X B p.s.d. Dual variable Q ij for each observed (i,j) maximize Q ij - ξ ij 0 Q ij c Q Y 2 1 sparse elementwise product (zero for unobserved entries)
49 Finding Max-Margin Matrix Factorizations Semi-definite program with sparse dual: Limited by number of observations, not size (for both average-norm and max-norm) Current implementation: use CSDP (off-the-shelf solver), up to 30k observations (e.g. 1000x1000, 3% observed) For large-scale problems: updates on dual alone? Dual variable Q ij for each observed (i,j) minimize tr(a)+tr(b) + c ξ ij maximize Q ij Y ij X ij 1 - ξ ij 0 Q ij c A X X B p.s.d. Q Y 2 1 sparse elementwise product (zero for unobserved entries)
50 Loss Functions for Rankings loss(x ij ;Y ij ) 0 Y ij =+1 X ij
51 Loss Functions for Rankings loss(x ij ;Y ij ) Y ij = θ 1 θ 2 θ 3 θ 4 θ 5 θ 6 X ij
52 Loss Functions for Rankings all-threshold loss loss(x ij ;Y ij ) Immediate-threshold loss [Shashua Levin 03] Y ij =3 θ 1 θ 2 θ 3 θ 4 θ 5 θ 6 X ij All-threshold loss is a bound on the absolute rank-difference For both loss functions: learn per-user θ s
53 Experimental Results on MovieLens Subset all threshold MMMF immediate threshold MMMF K-medians K=2 Rank-1 Rank-2 Rank Difference Zero One Error users 100 movies subset of MovieLens, 3515 training ratings, 3515 test ratings
54 Outline Finding Low Rank Approximations Weight Low Rank Approx: minimize ij W ij (Y ij -X ij ) 2 Use WLRA Basis for other losses / conditional models Consistency of Low Rank Approximation When more data is available, do we converge to correct solution? Not always Matrix Factorization for Collaborative Filtering Maximum Margin Matrix Factorization Generalization Error Bounds (Low Rank & MMMF)
55 Generalization Error Bounds D(X;Y) = ij loss(x ij ;Y ij )/nm generalization error Assuming a low-rank structure (eigengap): Asymptotic behavior [Azar+01] Sample complexity, query strategy [Drineas+02] X Y unknown, assumption-free
56 Generalization Error Bounds D(X;Y) = ij loss(x ij ;Y ij )/nm generalization error D S (X;Y) = ij S loss(x ij ;Y ij )/ S empirical error Y Pr S ( rank-k X D(X;Y)<D S (X;Y)+ε ) > 1-δ hypothesis X Y S source distribution training set random unknown, assumption-free
57 Generalization Error Bounds 0/1 loss: loss(x ij ;Y ij ) = 1 when sign(x ij ) Y ij D(X;Y) = ij loss(x ij ;Y ij )/nm generalization error D S (X,Y) = ij S loss(x ij ;Y ij )/ S empirical error For particular X,Y: random loss(x ij ;Y ij ) Bernoulli(D(X;Y)) Pr( D S (X;Y) < D(X;Y)-ε ) < e -2 S ε2 random Union bound over all possible Xs: Y Pr S ( X D(X,Y)<D S (X,Y)+ε ) > 1-δ ε = 8em klog(# ( n + of mpossible )log Xs) + log 1 2 S k δ
58 Number of Sign Configurations of Rank-k Matrices U V X nk variables mk variables X i,j = r U i,r V r,j nm polynomials of degree 2 polynomials Warren (1968): The number of connected components of { x i P i (x) 0 } is at most 4 e (degree) (#polys) (# variables) Based on [Alon95] (# variables) P 2 (x 1,x 2 )=0 P 1 (x 1,x 2 )= P 3 (x 1,x 2 )=0
59 Number of Sign Configurations of Rank-k Matrices U V X nk variables mk variables X i,j = r U i,r V r,j nm polynomials of degree 2 polynomials Warren (1968): The number of connected components of { x i P i (x) 0 } is at most Based on [Alon95] 4 e (degree) 2 (#polys) nm (# k(n+m) variables) k(n+m) (# variables) P 2 (x 1,x 2 )=0 P 1 (x 1,x 2 )= P 3 (x 1,x 2 )=0
60 Generalization Error Bounds: Low Rank Matrix Factorization D(X;Y) = ij loss(x ij ;Y ij )/nm generalization error D S (X,Y) = ij S loss(x ij ;Y ij )/ S empirical error Y Pr S ( X of rank-k D(X,Y)<D S (X,Y)+ε ) > 1-δ 0/1 loss: loss(x ij ;Y ij ) = sign(x ij Y ij ) ε = 8em k( n + m)log + log 1 2 S k δ loss(x ij ;Y ij ) 1: (by bounding the psudodimension) ε = 6 k( n + m)log 8em k log S S k ( n+ m) + log 1 δ
61 Generalization Error Bounds: Large Margin Matrix Factorization D(X;Y) = ij loss(x ij ;Y ij )/nm generalization error loss(x ij ;Y ij ) = sign(x ij Y ij ) D S (X;Y) = ij S loss 1 (X ij ;Y ij )/ S empirical error loss 1 (X ij ;Y ij ) = sign(x ij Y ij -1) Y Pr S ( X D(X,Y)<D S (X,Y)+ε ) > 1-δ universal constant from [Seginer00] bound on spectral norm of random matrix ( U i 2 /n)( V i 2 /m) R 2 : ε = K 4 ln m R 2 ( n + m)log n S + log 1 δ (max U i 2 )(max V j 2 ) R 2 : ε = 12 R 2 ( n + m) + S log 1 δ
62 Maximum Margin Matrix Factorization as a Convex Combination of Classifiers { UV ( U i 2 )( V i 2 ) 1 } = convex-hull( { uv u R n, v R m u = v =1} ) conv( { uv u ±1 n, v ±1m} ) { UV (max U i 2 )(max V j 2 ) 1 } 2 conv( { uv u ±1 n, v ±1m} ) Grothendiek's Inequality
63 Summary Finding Low Rank Approximations Weighted Low Rank Approximations Basis for other loss function: Newton, Gaussian mixtures Consistency of Low Rank Approximation ML for popular low-rank models is not consistent! PCA consistent for additive noise; diagonal ignoring for unbiased Efficient estimators? Consistent estimators for Exponential-PCA? Maximum Margin Matrix Factorization Correspondence with large margin linear classification Sparse SDPs for both avarage-norm and max-norm formulations Direct optimization of dual would enable large-scale applications Generalization Error Bounds for Collaborative Prediction First assumption free bounds for matrix completion Both for Low-Rank and for Max-Margin Observation process?
64 Average February Temperature (centigrade) Tel-Aviv Haifa (Technion) Boston Toronto????
Maximum Margin Matrix Factorization
Maximum Margin Matrix Factorization Nati Srebro Toyota Technological Institute Chicago Joint work with Noga Alon Tel-Aviv Yonatan Amit Hebrew U Alex d Aspremont Princeton Michael Fink Hebrew U Tommi Jaakkola
More informationRank, Trace-Norm & Max-Norm
Rank, Trace-Norm & Max-Norm as measures of matrix complexity Nati Srebro University of Toronto Adi Shraibman Hebrew University Matrix Learning users movies 2 1 4 5 5 4? 1 3 3 5 2 4? 5 3? 4 1 3 5 2 1? 4
More informationWeighted Low Rank Approximations
Weighted Low Rank Approximations Nathan Srebro and Tommi Jaakkola Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Weighted Low Rank Approximations What is
More informationMatrix Factorizations: A Tale of Two Norms
Matrix Factorizations: A Tale of Two Norms Nati Srebro Toyota Technological Institute Chicago Maximum Margin Matrix Factorization S, Jason Rennie, Tommi Jaakkola (MIT), NIPS 2004 Rank, Trace-Norm and Max-Norm
More informationLinear Dependent Dimensionality Reduction
Linear Dependent Dimensionality Reduction Nathan Srebro Tommi Jaakkola Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology Cambridge, MA 239 nati@mit.edu,tommi@ai.mit.edu
More informationLearning with Matrix Factorizations. Nathan Srebro
Learning with Matrix Factorizations by Nathan Srebro Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy
More informationPROBABILISTIC LATENT SEMANTIC ANALYSIS
PROBABILISTIC LATENT SEMANTIC ANALYSIS Lingjia Deng Revised from slides of Shuguang Wang Outline Review of previous notes PCA/SVD HITS Latent Semantic Analysis Probabilistic Latent Semantic Analysis Applications
More informationNathan Srebro. at the. August :...=... Department of Electrical Engineering and Computer Science August 16, 2004
Learning with Matrix Factorizations by Nathan Srebro Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationGeneralization Error Bounds for Collaborative Prediction with Low-Rank Matrices
Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices Nathan Srebro Department of Computer Science University of Toronto Toronto, ON, Canada nati@cs.toronto.edu Noga Alon School
More informationProbabilistic Latent Semantic Analysis
Probabilistic Latent Semantic Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr
More informationCollaborative Filtering: A Machine Learning Perspective
Collaborative Filtering: A Machine Learning Perspective Chapter 6: Dimensionality Reduction Benjamin Marlin Presenter: Chaitanya Desai Collaborative Filtering: A Machine Learning Perspective p.1/18 Topics
More informationGraphical Models for Collaborative Filtering
Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationLoss Functions for Preference Levels: Regression with Discrete Ordered Labels
Loss Functions for Preference Levels: Regression with Discrete Ordered Labels Jason D. M. Rennie Massachusetts Institute of Technology Comp. Sci. and Artificial Intelligence Laboratory Cambridge, MA 9,
More informationIntroduction to Machine Learning. Introduction to ML - TAU 2016/7 1
Introduction to Machine Learning Introduction to ML - TAU 2016/7 1 Course Administration Lecturers: Amir Globerson (gamir@post.tau.ac.il) Yishay Mansour (Mansour@tau.ac.il) Teaching Assistance: Regev Schweiger
More informationInformation retrieval LSI, plsi and LDA. Jian-Yun Nie
Information retrieval LSI, plsi and LDA Jian-Yun Nie Basics: Eigenvector, Eigenvalue Ref: http://en.wikipedia.org/wiki/eigenvector For a square matrix A: Ax = λx where x is a vector (eigenvector), and
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 20: Expectation Maximization Algorithm EM for Mixture Models Many figures courtesy Kevin Murphy s
More informationA Simple Algorithm for Nuclear Norm Regularized Problems
A Simple Algorithm for Nuclear Norm Regularized Problems ICML 00 Martin Jaggi, Marek Sulovský ETH Zurich Matrix Factorizations for recommender systems Y = Customer Movie UV T = u () The Netflix challenge:
More informationClustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26
Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Expectation Maximization Mark Schmidt University of British Columbia Winter 2018 Last Time: Learning with MAR Values We discussed learning with missing at random values in data:
More informationManifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA
Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inria.fr http://perception.inrialpes.fr/
More informationDocument and Topic Models: plsa and LDA
Document and Topic Models: plsa and LDA Andrew Levandoski and Jonathan Lobo CS 3750 Advanced Topics in Machine Learning 2 October 2018 Outline Topic Models plsa LSA Model Fitting via EM phits: link analysis
More informationData Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis
Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inrialpes.fr http://perception.inrialpes.fr/ Outline of Lecture
More informationOutline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Linear Models for Regression Linear Regression Probabilistic Interpretation
More informationLearning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text
Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute
More informationRecent Advances in Bayesian Inference Techniques
Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian
More informationAndriy Mnih and Ruslan Salakhutdinov
MATRIX FACTORIZATION METHODS FOR COLLABORATIVE FILTERING Andriy Mnih and Ruslan Salakhutdinov University of Toronto, Machine Learning Group 1 What is collaborative filtering? The goal of collaborative
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationNonnegative Matrix Factorization
Nonnegative Matrix Factorization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr
More informationProbabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms
Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms François Caron Department of Statistics, Oxford STATLEARN 2014, Paris April 7, 2014 Joint work with Adrien Todeschini,
More informationDimension Reduction. David M. Blei. April 23, 2012
Dimension Reduction David M. Blei April 23, 2012 1 Basic idea Goal: Compute a reduced representation of data from p -dimensional to q-dimensional, where q < p. x 1,...,x p z 1,...,z q (1) We want to do
More informationCS281 Section 4: Factor Analysis and PCA
CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we
More informationMachine Learning for Signal Processing Bayes Classification and Regression
Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin
More informationKernel methods, kernel SVM and ridge regression
Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;
More informationMACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA
1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR
More informationProbabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms
Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms Adrien Todeschini Inria Bordeaux JdS 2014, Rennes Aug. 2014 Joint work with François Caron (Univ. Oxford), Marie
More informationMixtures of Gaussians. Sargur Srihari
Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm
More informationGenerative Clustering, Topic Modeling, & Bayesian Inference
Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week
More informationProbabilistic & Unsupervised Learning
Probabilistic & Unsupervised Learning Week 2: Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College
More informationConvex Optimization M2
Convex Optimization M2 Lecture 8 A. d Aspremont. Convex Optimization M2. 1/57 Applications A. d Aspremont. Convex Optimization M2. 2/57 Outline Geometrical problems Approximation problems Combinatorial
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft
More informationClustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning
Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades
More informationMatrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang
Matrix Factorization & Latent Semantic Analysis Review Yize Li, Lanbo Zhang Overview SVD in Latent Semantic Indexing Non-negative Matrix Factorization Probabilistic Latent Semantic Indexing Vector Space
More informationcross-language retrieval (by concatenate features of different language in X and find co-shared U). TOEFL/GRE synonym in the same way.
10-708: Probabilistic Graphical Models, Spring 2015 22 : Optimization and GMs (aka. LDA, Sparse Coding, Matrix Factorization, and All That ) Lecturer: Yaoliang Yu Scribes: Yu-Xiang Wang, Su Zhou 1 Introduction
More informationhe Applications of Tensor Factorization in Inference, Clustering, Graph Theory, Coding and Visual Representation
he Applications of Tensor Factorization in Inference, Clustering, Graph Theory, Coding and Visual Representation Amnon Shashua School of Computer Science & Eng. The Hebrew University Matrix Factorization
More informationFinal Exam, Machine Learning, Spring 2009
Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3
More informationSTA414/2104. Lecture 11: Gaussian Processes. Department of Statistics
STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations
More informationParametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a
Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a
More informationData Mining Techniques
Data Mining Techniques CS 622 - Section 2 - Spring 27 Pre-final Review Jan-Willem van de Meent Feedback Feedback https://goo.gl/er7eo8 (also posted on Piazza) Also, please fill out your TRACE evaluations!
More informationBinary Principal Component Analysis in the Netflix Collaborative Filtering Task
Binary Principal Component Analysis in the Netflix Collaborative Filtering Task László Kozma, Alexander Ilin, Tapani Raiko first.last@tkk.fi Helsinki University of Technology Adaptive Informatics Research
More informationUsing SVD to Recommend Movies
Michael Percy University of California, Santa Cruz Last update: December 12, 2009 Last update: December 12, 2009 1 / Outline 1 Introduction 2 Singular Value Decomposition 3 Experiments 4 Conclusion Last
More information> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel
Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation
More informationLeast Squares Regression
E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute
More informationExpectation Maximization
Expectation Maximization Machine Learning CSE546 Carlos Guestrin University of Washington November 13, 2014 1 E.M.: The General Case E.M. widely used beyond mixtures of Gaussians The recipe is the same
More informationAdaptive one-bit matrix completion
Adaptive one-bit matrix completion Joseph Salmon Télécom Paristech, Institut Mines-Télécom Joint work with Jean Lafond (Télécom Paristech) Olga Klopp (Crest / MODAL X, Université Paris Ouest) Éric Moulines
More informationCPSC 340 Assignment 4 (due November 17 ATE)
CPSC 340 Assignment 4 due November 7 ATE) Multi-Class Logistic The function example multiclass loads a multi-class classification datasetwith y i {,, 3, 4, 5} and fits a one-vs-all classification model
More informationUnsupervised Learning
Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College
More informationBased on slides by Richard Zemel
CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we
More informationLeast Squares Regression
CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the
More informationGeneralized Principal Component Analysis: Projection of Saturated Model Parameters
Generalized Principal Component Analysis: Projection of Saturated Model Parameters Andrew J. Landgraf and Yoonkyung Lee Abstract Principal component analysis (PCA) is very useful for a wide variety of
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationFundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner
Fundamentals CS 281A: Statistical Learning Theory Yangqing Jia Based on tutorial slides by Lester Mackey and Ariel Kleiner August, 2011 Outline 1 Probability 2 Statistics 3 Linear Algebra 4 Optimization
More information9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering
Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make
More informationPCA, Kernel PCA, ICA
PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per
More informationLecture 7: Con3nuous Latent Variable Models
CSC2515 Fall 2015 Introduc3on to Machine Learning Lecture 7: Con3nuous Latent Variable Models All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 13: Learning in Gaussian Graphical Models, Non-Gaussian Inference, Monte Carlo Methods Some figures
More informationReading Group on Deep Learning Session 1
Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular
More informationAdvanced Introduction to Machine Learning
10-715 Advanced Introduction to Machine Learning Homework 3 Due Nov 12, 10.30 am Rules 1. Homework is due on the due date at 10.30 am. Please hand over your homework at the beginning of class. Please see
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationLatent Variable View of EM. Sargur Srihari
Latent Variable View of EM Sargur srihari@cedar.buffalo.edu 1 Examples of latent variables 1. Mixture Model Joint distribution is p(x,z) We don t have values for z 2. Hidden Markov Model A single time
More informationCPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018
CPSC 340: Machine Learning and Data Mining Sparse Matrix Factorization Fall 2018 Last Time: PCA with Orthogonal/Sequential Basis When k = 1, PCA has a scaling problem. When k > 1, have scaling, rotation,
More informationapproximation algorithms I
SUM-OF-SQUARES method and approximation algorithms I David Steurer Cornell Cargese Workshop, 201 meta-task encoded as low-degree polynomial in R x example: f(x) = i,j n w ij x i x j 2 given: functions
More informationModeling User Rating Profiles For Collaborative Filtering
Modeling User Rating Profiles For Collaborative Filtering Benjamin Marlin Department of Computer Science University of Toronto Toronto, ON, M5S 3H5, CANADA marlin@cs.toronto.edu Abstract In this paper
More informationScaling Neighbourhood Methods
Quick Recap Scaling Neighbourhood Methods Collaborative Filtering m = #items n = #users Complexity : m * m * n Comparative Scale of Signals ~50 M users ~25 M items Explicit Ratings ~ O(1M) (1 per billion)
More informationGenerative Models for Discrete Data
Generative Models for Discrete Data ddebarr@uw.edu 2016-04-21 Agenda Bayesian Concept Learning Beta-Binomial Model Dirichlet-Multinomial Model Naïve Bayes Classifiers Bayesian Concept Learning Numbers
More informationPattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite
More informationGeneralized Linear Models in Collaborative Filtering
Hao Wu CME 323, Spring 2016 WUHAO@STANFORD.EDU Abstract This study presents a distributed implementation of the collaborative filtering method based on generalized linear models. The algorithm is based
More informationStudy Notes on the Latent Dirichlet Allocation
Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection
More informationLarge-scale Collaborative Prediction Using a Nonparametric Random Effects Model
Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model Kai Yu Joint work with John Lafferty and Shenghuo Zhu NEC Laboratories America, Carnegie Mellon University First Prev Page
More informationExpectation Maximization Algorithm
Expectation Maximization Algorithm Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein, Luke Zettlemoyer and Dan Weld The Evils of Hard Assignments? Clusters
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationK-Means and Gaussian Mixture Models
K-Means and Gaussian Mixture Models David Rosenberg New York University October 29, 2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016 1 / 42 K-Means Clustering K-Means Clustering David
More informationCollaborative topic models: motivations cont
Collaborative topic models: motivations cont Two topics: machine learning social network analysis Two people: " boy Two articles: article A! girl article B Preferences: The boy likes A and B --- no problem.
More informationBehavioral Data Mining. Lecture 7 Linear and Logistic Regression
Behavioral Data Mining Lecture 7 Linear and Logistic Regression Outline Linear Regression Regularization Logistic Regression Stochastic Gradient Fast Stochastic Methods Performance tips Linear Regression
More informationParametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory
Statistical Inference Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory IP, José Bioucas Dias, IST, 2007
More informationSparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28
Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models 1 / 28 Topics Standard sparse regression model algorithms: convex relaxation and greedy algorithm sparse recovery analysis:
More informationMachine Learning Lecture 5
Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory
More informationIntroduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin
1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)
More informationECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis
ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis Lecture 7: Matrix completion Yuejie Chi The Ohio State University Page 1 Reference Guaranteed Minimum-Rank Solutions of Linear
More informationBut if z is conditioned on, we need to model it:
Partially Unobserved Variables Lecture 8: Unsupervised Learning & EM Algorithm Sam Roweis October 28, 2003 Certain variables q in our models may be unobserved, either at training time or at test time or
More informationThe Variational Gaussian Approximation Revisited
The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much
More information