Learning with Matrix Factorizations

Size: px
Start display at page:

Download "Learning with Matrix Factorizations"

Transcription

1 Learning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology Adviser: Tommi Jaakkola (MIT) Committee: Alan Willsky (MIT), Tali Tishby (HUJI), Josh Tenenbaum (MIT) Joint work with Shie Mannor (McGill), Noga Alon (TAU) and Jason Rennie (MIT) Lots of help from John Barnett, Karen Livescu, Adi Shraibman, Michael Collins, Erik Demaine, Michel Goemans, David Karger and Dmitry Panchenko

2 Dimensionality Reduction: Low dimensional representation capturing important aspects of high dimensional data observations y image gene expression levels a user s preferences document small representation u varying aspects of scene description of biological process characterization of user topics Compression (mostly to reduce processing time) Reconstructing latent signal biological processes through gene expression Capturing structure in a corpus documents, images, etc Prediction: collaborative filtering

3 Linear Dimensionality Reduction y u1 v 1 + u2 v 2 + u3 v 3

4 Linear Dimensionality Reduction titles titles y u1 +1 v 1 comic value + u2 +2 v 2 dramatic value + u3-1 v 3 violence preferences of a specific user (real-valued preference level for each title) characteristics of the user

5 Linear Dimensionality Reduction titles y u titles V

6 Linear Dimensionality Reduction users titles y Y data u U titles V

7 Matrix Factorization Y data U V = X rank k Non-Negativity [LeeSeung99] Stochasticity (convexity) [LeeSeung97][Barnett04] Sparsity Clustering as an extreme (when rows of U sparse) Unconstrained: Low Rank Approximation

8 Matrix Factorization Y data U V = + Z i.i.d. Gaussian Additive Gaussian noise minimize Y-UV Fro ( ) 1 = = 2 '; Y log P( Y UV ' ) ( Y UV ' ) + log L UV 2 ij ij ij 2σ ij ij ij const

9 Matrix Factorization Y data U V = + Z i.i.d. p(z ij ) Additive Gaussian noise minimize Y-UV Fro General additive noise minimize -log p(y ij -UV ij )

10 Additive Gaussian noise minimize Y-UV Fro General additive noise minimize -log p(y ij -UV ij ) General conditional models minimize -log p(y ij UV ij ) Multiplicative noise Binary observations: Logistic LRA Exponential PCA [Collins+01] Multinomial (plsa [Hofman01]) Other loss functions [Gordon02] minimize loss(uv ;Y ) p(y ij UV ij ) Matrix Factorization Y data U V

11 topic 1..k T I The Aspect Model (plsa) J Y co-occurrence occurrence counts word X p(i,j) rank k = p(i,t) [Hoffman+99] p(j t) document Y X Multinomial(N,X) Y ij X ij Binomial(N,X ij ) N= Y ij

12 Low-Rank Models for Matrices of Counts, Occurrences or Frequencies Multinomial Independent Binomials Independent Bernoulli Mean parameterization 0 X ij 1 E[Y ij X ij ]=X ij Natural parameterization unconstrained X ij Aspect Model (plsa) [Hoffman+99] Sufficient Dimensionality Reduction [Globerson+02] Y ij X ij ~Bin(N,X ij ) NMF if X ij =1 NMF [Lee+01] Y ij X ij ~Bin(N,g(X ij )) P(Y ij =1) = X ij Exponential PCA: [Collins+01] p(y ij X ij ) exp(y ij X ij +F(Y ij )) Logistic Low Rank Approximation [Schein+03] hinge loss g(x)=1/(1+e x )

13 Outline Finding Low Rank Approximations Weight Low Rank Approx: minimize ij W ij (Y ij -X ij ) 2 Use WLRA Basis for other losses / conditional models Consistency of Low Rank Approximation When more data is available, do we converge to correct solution? Not always Matrix Factorization for Collaborative Filtering Maximum Margin Matrix Factorization Generalization Error Bounds (Low Rank & MMMF)

14 Finding Low Rank Approximation Find rank-k X minimizing loss(x ij ;Y ij ) (=log p(y X)) Non-convex: X is rank-k is a non-convex constraint loss((uv) ij ;Y ij ) not convex in U,V rank-k X minimizing (Y ij -X ij ) 2 : non-convex, but no (non-global) local minima solution: leading components of SVD For other loss functions, or with missing data: cannot use SVD, local minima, difficult problem Weighted Low Rank Approximation: minimize W ij (Y ij -X ij ) 2 Arbitrarily specified weights (part of input)

15 WLRA: Optimization J ( UV ') = W ( Y UV ') ij ij 2 ij For fixed V, find optimal U For fixed U, find optimal V n k U m V k V J * ( V ) = minj ( UV ') U J* ( V ') = 2U*'(( U* V ' Y ) W ) Conjugate gradient descent on J* Optimize km parameters instead of k(n+m) EM approach: elementwise product X LowRankApprox(W Y+(1-W) X)

16 Newton Optimization for Non-Quadratic Loss Functions minimize ij loss(x ij ;Y ij ) loss(x ij ;Y ij ) convex in X ij Iteratively optimize quadratic approximations of objective Each such quadratic optimization is a weighted low rank approximation

17 Maximum Likelihood Estimation with Gaussian Mixture Noise Y observed = X + rank k Z noise C (hidden) Mixture of Gaussians: {p c,µ c,σ c } E step: calculate posteriors of C M step: WLRA with Pr( Cij = c ) Wij = 2 σ c C A Pr( C ij ij = Yij + 2 c σc = c ) µ C W ij

18 Outline Finding Low Rank Approximations Weight Low Rank Approx: minimize ij W ij (Y ij -X ij ) 2 Use WLRA Basis for other losses / conditional models Consistency of Low Rank Approximation When more data is available, do we converge to correct solution? Not always Matrix Factorization for Collaborative Filtering Maximum Margin Matrix Factorization Generalization Error Bounds (Low Rank & MMMF)

19 Asymptotic Behavior of Low Rank Approximation parameters random Y X = + rank k Z independent Single observation, Number of parameters is linear in number of observables Can never approach correct estimation of parameters What can be estimated is row-space of X

20 Y U V = + random Z independent Number of parameters is linear in number of observables What can be estimated is row-space of X

21 random V random Y = + U Z independent What can be estimated is row-space of X

22 y = u + V z Multiple samples of random variable y What can be estimated is row-space of X

23 Probabilistic PCA y = u + ~ Gaussian V z ~ Gaussian x ~ Gaussian Σ X rank k estimated parameters σ 2 I Maximum likelihood PCA [Tipping Bishop 97]

24 Probabilistic PCA [Tipping Bishop 97] y = u + ~ Gaussian V estimated parameters z ~ Gaussian(σ 2 I) Latent Dirichlet Allocation [Blei Ng Jordan 03] y ~ V Multinomial( N, u ) ~ Dirichlet(α) estimated parameters

25 Generative and Non-Generative Low Rank Models Y=X+Gaussian Probabilistic PCA plsa, Y~Binom(N,X) Non-parametric generative models Latent Dirichlet Allocatoin Parametric generative models: Consistency of Maximum Likelihood estimation guaranteed if model assumptions hold (both on Y X and on U)

26 estimated parameters row-space of V, not entries of V y = u + P U x V z ~ Gaussian nonparametric, unconstrained nuisance Non-parametric model, estimation of a parametric part of the model Maximum Likelihood Single Observed Y

27 Consistency of ML Estimation estimated V true V

28 Consistency of ML Estimation V expected contribution of x to likelihood of V x [ max log p (( x + z ) )] Ψ( V ; x ) = E uv ML estimator is consistent for any P u z u Z for all x, V maximizes iff V spans x Ψ( V ; x ) When Z~Gaussian, Ψ ( V ; x ) = E[L 2 distance of x+z from V] For iid Gaussian noise, ML estimation (PCA) of the low-rank sub-space is consistent

29 Consistency of ML Estimation V expected contribution of x to likelihood of V x=0 [ max log p (( x + z ) )] Ψ( V ; x ) = E uv ML estimator is consistent for any P u z u Z for all x, V maximizes iff V spans x Ψ( V ; x ) ML estimator is consistent for any P u Ψ(V ;0) is constant for all V

30 Consistency of ML Estimation V expected contribution of x to likelihood of V x=0 θ [ max log p (( x + z ) )] Ψ( V ; x ) = E z u Z uv 1 z [ i ] Laplace: p ( z [ i ]) = e Z vertical axis Ψ(V ;0) = 1 horizontal axis π θ= 2 π

31 Consistency of ML Estimation expected contribution of x to likelihood of V X=UV General conditional model for Y X [ max log p ( Y uv ) x] Ψ( V; x) = E X u Y ML estimator is consistent for any P u Y X for all x, V maximizes iff V spans x Ψ( V ; x ) ML estimator is consistent for any P u Ψ(V ;0) is constant for all V

32 Consistency of ML Estimation V expected contribution of x to likelihood of V x=0 θ [ max log p ( Y uv ) x] Ψ( V; x) = E X u Y Logistic: py X Y i = x Y X i 1 = 1 e ( [ ] 1 [ ]) x[ i] Ψ(V ;0) = θ= π π 2

33 Consistent Estimation Additive i.i.d. noise Y=X+Z: Maximum Likelihood generally not consistent PCA is consistent (for any noise distribution) Span of k leading eigenvectors of Σˆ Y Σ X + σ 2 I rank k

34 Consistent Estimation Additive i.i.d. noise Y=X+Z: Maximum Likelihood generally not consistent PCA is consistent (for any noise distribution) s 1, s 2,, s k, 0, 0,, 0 Σˆ Y Σ Y = Σ X + Σ Z = Σ X + σ 2 I s 1 +σ 2, s 2 +σ 2,, s k +σ 2, σ 2, σ 2,, σ 2

35 Consistent Estimation Additive i.i.d. noise Y=X+Z: Maximum Likelihood generally not consistent PCA is consistent (for any noise distribution) distance to correct subspace ML PCA Noise: 0.99 N(0,1) N(0,100) Sample size (number of observed rows)

36 Consistent Estimation Additive i.i.d. noise Y=X+Z: Maximum Likelihood generally not consistent PCA is consistent (for any noise distribution) Unbiased noise E[Y X]=X: Maximum Likelihood generally not consistent PCA not consistent -- Can correct by ignoring diagonal of covariance Exponential PCA (X are natural parameters) Maximum Likelihood generally not consistent Covariance methods not consistent???

37 Outline Finding Low Rank Approximations Weight Low Rank Approx: minimize ij W ij (Y ij -X ij ) 2 Use WLRA Basis for other losses / conditional models Consistency of Low Rank Approximation When more data is available, do we converge to correct solution? Not always Matrix Factorization for Collaborative Filtering Maximum Margin Matrix Factorization Generalization Error Bounds (Low Rank & MMMF)

38 Collaborative Filtering Based on preferences so far, and preferences of others: Predict further preferences users movies Implicit or explicit preferences? Type of queries.

39 Matrix Completion Based on partially observed matrix: Predict unobserved entries users movies ? ? 5 3? ? ? 5? ? 5 2? ? ? Will user i like movie j?

40 Matrix Completion with Matrix Factorization U V ? ? ? ? ? +1 5? ? ? ? ? Fit factorizable (lowrank) matrix X=UV to observed entries. minimize Σloss(X ij ;Y ij ) prediction observation Use matrix X to predict unobserved entries. [Sarwar+00] [Azar+01] [Hoffman04] [Marlin+04]

41 Matrix Completion with Matrix Factorization U v12 v9 v10 v11 v8 v7 v6 v5 v4 v3 v2 v When U is fixed, each row is a linear classification problem: rows of U are feature vectors columns of V are linear classifiers Fitting U and V: Learning features that work well across all classification problems.

42 Max-Margin Matrix Factorization low norm U V Instead of bounding dimensionality of U,V, bound norms of U,V For observed Y ij ±1: Y ij X ij Margin U i,v j

43 Max-Margin Matrix Factorization low norm U V bound norms on average: ( i U i 2 ) ( j V j 2 ) 1 bound norms uniformly: (max i U i 2 ) (max j V j 2 ) 1 For observed Y ij ±1: Y ij X ij Margin U i,v j U is fixed: each column of V is SVM

44 Max-Margin Matrix Factorization low norm U V bound norms on average: ( i U i 2 ) ( j V j 2 ) 1 bound norms uniformly: (max i U i 2 ) (max j V j 2 ) 1 For observed Y ij ±1: Y ij X ij Margin U i,v j U is fixed: each column of V is SVM

45 Geometric Interpretation V columns of V (items) rows of U (users) U (max i U i 2 ) (max j V j 2 ) 1

46 Geometric Interpretation V columns of V (items) rows of U (users) U (max i U i 2 ) (max j V j 2 ) 1

47 Finding Max-Margin Matrix Factorizations maximize M Y ij X ij M X=UV ( i U i 2 ) ( j V j 2 ) 1 maximize M Y ij X ij M X=UV (max i U i 2 ) (max j V j 2 ) 1 Unlike rank(x) k, these are convex constraints!

48 Finding Max-Margin Matrix Factorizations maximize M Y ij X ij M X=UV ( i U i 2 ) ( j V j 2 ) 1 X Y Q X tr = (singular values of X) minimize tr(a)+tr(b) + c ξ ij A X Y ij X ij 1 X B p.s.d. Dual variable Q ij for each observed (i,j) maximize Q ij - ξ ij 0 Q ij c Q Y 2 1 sparse elementwise product (zero for unobserved entries)

49 Finding Max-Margin Matrix Factorizations Semi-definite program with sparse dual: Limited by number of observations, not size (for both average-norm and max-norm) Current implementation: use CSDP (off-the-shelf solver), up to 30k observations (e.g. 1000x1000, 3% observed) For large-scale problems: updates on dual alone? Dual variable Q ij for each observed (i,j) minimize tr(a)+tr(b) + c ξ ij maximize Q ij Y ij X ij 1 - ξ ij 0 Q ij c A X X B p.s.d. Q Y 2 1 sparse elementwise product (zero for unobserved entries)

50 Loss Functions for Rankings loss(x ij ;Y ij ) 0 Y ij =+1 X ij

51 Loss Functions for Rankings loss(x ij ;Y ij ) Y ij = θ 1 θ 2 θ 3 θ 4 θ 5 θ 6 X ij

52 Loss Functions for Rankings all-threshold loss loss(x ij ;Y ij ) Immediate-threshold loss [Shashua Levin 03] Y ij =3 θ 1 θ 2 θ 3 θ 4 θ 5 θ 6 X ij All-threshold loss is a bound on the absolute rank-difference For both loss functions: learn per-user θ s

53 Experimental Results on MovieLens Subset all threshold MMMF immediate threshold MMMF K-medians K=2 Rank-1 Rank-2 Rank Difference Zero One Error users 100 movies subset of MovieLens, 3515 training ratings, 3515 test ratings

54 Outline Finding Low Rank Approximations Weight Low Rank Approx: minimize ij W ij (Y ij -X ij ) 2 Use WLRA Basis for other losses / conditional models Consistency of Low Rank Approximation When more data is available, do we converge to correct solution? Not always Matrix Factorization for Collaborative Filtering Maximum Margin Matrix Factorization Generalization Error Bounds (Low Rank & MMMF)

55 Generalization Error Bounds D(X;Y) = ij loss(x ij ;Y ij )/nm generalization error Assuming a low-rank structure (eigengap): Asymptotic behavior [Azar+01] Sample complexity, query strategy [Drineas+02] X Y unknown, assumption-free

56 Generalization Error Bounds D(X;Y) = ij loss(x ij ;Y ij )/nm generalization error D S (X;Y) = ij S loss(x ij ;Y ij )/ S empirical error Y Pr S ( rank-k X D(X;Y)<D S (X;Y)+ε ) > 1-δ hypothesis X Y S source distribution training set random unknown, assumption-free

57 Generalization Error Bounds 0/1 loss: loss(x ij ;Y ij ) = 1 when sign(x ij ) Y ij D(X;Y) = ij loss(x ij ;Y ij )/nm generalization error D S (X,Y) = ij S loss(x ij ;Y ij )/ S empirical error For particular X,Y: random loss(x ij ;Y ij ) Bernoulli(D(X;Y)) Pr( D S (X;Y) < D(X;Y)-ε ) < e -2 S ε2 random Union bound over all possible Xs: Y Pr S ( X D(X,Y)<D S (X,Y)+ε ) > 1-δ ε = 8em klog(# ( n + of mpossible )log Xs) + log 1 2 S k δ

58 Number of Sign Configurations of Rank-k Matrices U V X nk variables mk variables X i,j = r U i,r V r,j nm polynomials of degree 2 polynomials Warren (1968): The number of connected components of { x i P i (x) 0 } is at most 4 e (degree) (#polys) (# variables) Based on [Alon95] (# variables) P 2 (x 1,x 2 )=0 P 1 (x 1,x 2 )= P 3 (x 1,x 2 )=0

59 Number of Sign Configurations of Rank-k Matrices U V X nk variables mk variables X i,j = r U i,r V r,j nm polynomials of degree 2 polynomials Warren (1968): The number of connected components of { x i P i (x) 0 } is at most Based on [Alon95] 4 e (degree) 2 (#polys) nm (# k(n+m) variables) k(n+m) (# variables) P 2 (x 1,x 2 )=0 P 1 (x 1,x 2 )= P 3 (x 1,x 2 )=0

60 Generalization Error Bounds: Low Rank Matrix Factorization D(X;Y) = ij loss(x ij ;Y ij )/nm generalization error D S (X,Y) = ij S loss(x ij ;Y ij )/ S empirical error Y Pr S ( X of rank-k D(X,Y)<D S (X,Y)+ε ) > 1-δ 0/1 loss: loss(x ij ;Y ij ) = sign(x ij Y ij ) ε = 8em k( n + m)log + log 1 2 S k δ loss(x ij ;Y ij ) 1: (by bounding the psudodimension) ε = 6 k( n + m)log 8em k log S S k ( n+ m) + log 1 δ

61 Generalization Error Bounds: Large Margin Matrix Factorization D(X;Y) = ij loss(x ij ;Y ij )/nm generalization error loss(x ij ;Y ij ) = sign(x ij Y ij ) D S (X;Y) = ij S loss 1 (X ij ;Y ij )/ S empirical error loss 1 (X ij ;Y ij ) = sign(x ij Y ij -1) Y Pr S ( X D(X,Y)<D S (X,Y)+ε ) > 1-δ universal constant from [Seginer00] bound on spectral norm of random matrix ( U i 2 /n)( V i 2 /m) R 2 : ε = K 4 ln m R 2 ( n + m)log n S + log 1 δ (max U i 2 )(max V j 2 ) R 2 : ε = 12 R 2 ( n + m) + S log 1 δ

62 Maximum Margin Matrix Factorization as a Convex Combination of Classifiers { UV ( U i 2 )( V i 2 ) 1 } = convex-hull( { uv u R n, v R m u = v =1} ) conv( { uv u ±1 n, v ±1m} ) { UV (max U i 2 )(max V j 2 ) 1 } 2 conv( { uv u ±1 n, v ±1m} ) Grothendiek's Inequality

63 Summary Finding Low Rank Approximations Weighted Low Rank Approximations Basis for other loss function: Newton, Gaussian mixtures Consistency of Low Rank Approximation ML for popular low-rank models is not consistent! PCA consistent for additive noise; diagonal ignoring for unbiased Efficient estimators? Consistent estimators for Exponential-PCA? Maximum Margin Matrix Factorization Correspondence with large margin linear classification Sparse SDPs for both avarage-norm and max-norm formulations Direct optimization of dual would enable large-scale applications Generalization Error Bounds for Collaborative Prediction First assumption free bounds for matrix completion Both for Low-Rank and for Max-Margin Observation process?

64 Average February Temperature (centigrade) Tel-Aviv Haifa (Technion) Boston Toronto????

Maximum Margin Matrix Factorization

Maximum Margin Matrix Factorization Maximum Margin Matrix Factorization Nati Srebro Toyota Technological Institute Chicago Joint work with Noga Alon Tel-Aviv Yonatan Amit Hebrew U Alex d Aspremont Princeton Michael Fink Hebrew U Tommi Jaakkola

More information

Rank, Trace-Norm & Max-Norm

Rank, Trace-Norm & Max-Norm Rank, Trace-Norm & Max-Norm as measures of matrix complexity Nati Srebro University of Toronto Adi Shraibman Hebrew University Matrix Learning users movies 2 1 4 5 5 4? 1 3 3 5 2 4? 5 3? 4 1 3 5 2 1? 4

More information

Weighted Low Rank Approximations

Weighted Low Rank Approximations Weighted Low Rank Approximations Nathan Srebro and Tommi Jaakkola Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Weighted Low Rank Approximations What is

More information

Matrix Factorizations: A Tale of Two Norms

Matrix Factorizations: A Tale of Two Norms Matrix Factorizations: A Tale of Two Norms Nati Srebro Toyota Technological Institute Chicago Maximum Margin Matrix Factorization S, Jason Rennie, Tommi Jaakkola (MIT), NIPS 2004 Rank, Trace-Norm and Max-Norm

More information

Linear Dependent Dimensionality Reduction

Linear Dependent Dimensionality Reduction Linear Dependent Dimensionality Reduction Nathan Srebro Tommi Jaakkola Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology Cambridge, MA 239 nati@mit.edu,tommi@ai.mit.edu

More information

Learning with Matrix Factorizations. Nathan Srebro

Learning with Matrix Factorizations. Nathan Srebro Learning with Matrix Factorizations by Nathan Srebro Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy

More information

PROBABILISTIC LATENT SEMANTIC ANALYSIS

PROBABILISTIC LATENT SEMANTIC ANALYSIS PROBABILISTIC LATENT SEMANTIC ANALYSIS Lingjia Deng Revised from slides of Shuguang Wang Outline Review of previous notes PCA/SVD HITS Latent Semantic Analysis Probabilistic Latent Semantic Analysis Applications

More information

Nathan Srebro. at the. August :...=... Department of Electrical Engineering and Computer Science August 16, 2004

Nathan Srebro. at the. August :...=... Department of Electrical Engineering and Computer Science August 16, 2004 Learning with Matrix Factorizations by Nathan Srebro Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices

Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices Nathan Srebro Department of Computer Science University of Toronto Toronto, ON, Canada nati@cs.toronto.edu Noga Alon School

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

Collaborative Filtering: A Machine Learning Perspective

Collaborative Filtering: A Machine Learning Perspective Collaborative Filtering: A Machine Learning Perspective Chapter 6: Dimensionality Reduction Benjamin Marlin Presenter: Chaitanya Desai Collaborative Filtering: A Machine Learning Perspective p.1/18 Topics

More information

Graphical Models for Collaborative Filtering

Graphical Models for Collaborative Filtering Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Loss Functions for Preference Levels: Regression with Discrete Ordered Labels

Loss Functions for Preference Levels: Regression with Discrete Ordered Labels Loss Functions for Preference Levels: Regression with Discrete Ordered Labels Jason D. M. Rennie Massachusetts Institute of Technology Comp. Sci. and Artificial Intelligence Laboratory Cambridge, MA 9,

More information

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1 Introduction to Machine Learning Introduction to ML - TAU 2016/7 1 Course Administration Lecturers: Amir Globerson (gamir@post.tau.ac.il) Yishay Mansour (Mansour@tau.ac.il) Teaching Assistance: Regev Schweiger

More information

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Information retrieval LSI, plsi and LDA. Jian-Yun Nie Information retrieval LSI, plsi and LDA Jian-Yun Nie Basics: Eigenvector, Eigenvalue Ref: http://en.wikipedia.org/wiki/eigenvector For a square matrix A: Ax = λx where x is a vector (eigenvector), and

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 20: Expectation Maximization Algorithm EM for Mixture Models Many figures courtesy Kevin Murphy s

More information

A Simple Algorithm for Nuclear Norm Regularized Problems

A Simple Algorithm for Nuclear Norm Regularized Problems A Simple Algorithm for Nuclear Norm Regularized Problems ICML 00 Martin Jaggi, Marek Sulovský ETH Zurich Matrix Factorizations for recommender systems Y = Customer Movie UV T = u () The Netflix challenge:

More information

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26 Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Expectation Maximization Mark Schmidt University of British Columbia Winter 2018 Last Time: Learning with MAR Values We discussed learning with missing at random values in data:

More information

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inria.fr http://perception.inrialpes.fr/

More information

Document and Topic Models: plsa and LDA

Document and Topic Models: plsa and LDA Document and Topic Models: plsa and LDA Andrew Levandoski and Jonathan Lobo CS 3750 Advanced Topics in Machine Learning 2 October 2018 Outline Topic Models plsa LSA Model Fitting via EM phits: link analysis

More information

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inrialpes.fr http://perception.inrialpes.fr/ Outline of Lecture

More information

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012) Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Linear Models for Regression Linear Regression Probabilistic Interpretation

More information

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Andriy Mnih and Ruslan Salakhutdinov

Andriy Mnih and Ruslan Salakhutdinov MATRIX FACTORIZATION METHODS FOR COLLABORATIVE FILTERING Andriy Mnih and Ruslan Salakhutdinov University of Toronto, Machine Learning Group 1 What is collaborative filtering? The goal of collaborative

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Nonnegative Matrix Factorization

Nonnegative Matrix Factorization Nonnegative Matrix Factorization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms François Caron Department of Statistics, Oxford STATLEARN 2014, Paris April 7, 2014 Joint work with Adrien Todeschini,

More information

Dimension Reduction. David M. Blei. April 23, 2012

Dimension Reduction. David M. Blei. April 23, 2012 Dimension Reduction David M. Blei April 23, 2012 1 Basic idea Goal: Compute a reduced representation of data from p -dimensional to q-dimensional, where q < p. x 1,...,x p z 1,...,z q (1) We want to do

More information

CS281 Section 4: Factor Analysis and PCA

CS281 Section 4: Factor Analysis and PCA CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we

More information

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for Signal Processing Bayes Classification and Regression Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

Kernel methods, kernel SVM and ridge regression

Kernel methods, kernel SVM and ridge regression Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;

More information

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR

More information

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms Adrien Todeschini Inria Bordeaux JdS 2014, Rennes Aug. 2014 Joint work with François Caron (Univ. Oxford), Marie

More information

Mixtures of Gaussians. Sargur Srihari

Mixtures of Gaussians. Sargur Srihari Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm

More information

Generative Clustering, Topic Modeling, & Bayesian Inference

Generative Clustering, Topic Modeling, & Bayesian Inference Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week

More information

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probabilistic & Unsupervised Learning Week 2: Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College

More information

Convex Optimization M2

Convex Optimization M2 Convex Optimization M2 Lecture 8 A. d Aspremont. Convex Optimization M2. 1/57 Applications A. d Aspremont. Convex Optimization M2. 2/57 Outline Geometrical problems Approximation problems Combinatorial

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft

More information

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades

More information

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang Matrix Factorization & Latent Semantic Analysis Review Yize Li, Lanbo Zhang Overview SVD in Latent Semantic Indexing Non-negative Matrix Factorization Probabilistic Latent Semantic Indexing Vector Space

More information

cross-language retrieval (by concatenate features of different language in X and find co-shared U). TOEFL/GRE synonym in the same way.

cross-language retrieval (by concatenate features of different language in X and find co-shared U). TOEFL/GRE synonym in the same way. 10-708: Probabilistic Graphical Models, Spring 2015 22 : Optimization and GMs (aka. LDA, Sparse Coding, Matrix Factorization, and All That ) Lecturer: Yaoliang Yu Scribes: Yu-Xiang Wang, Su Zhou 1 Introduction

More information

he Applications of Tensor Factorization in Inference, Clustering, Graph Theory, Coding and Visual Representation

he Applications of Tensor Factorization in Inference, Clustering, Graph Theory, Coding and Visual Representation he Applications of Tensor Factorization in Inference, Clustering, Graph Theory, Coding and Visual Representation Amnon Shashua School of Computer Science & Eng. The Hebrew University Matrix Factorization

More information

Final Exam, Machine Learning, Spring 2009

Final Exam, Machine Learning, Spring 2009 Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 622 - Section 2 - Spring 27 Pre-final Review Jan-Willem van de Meent Feedback Feedback https://goo.gl/er7eo8 (also posted on Piazza) Also, please fill out your TRACE evaluations!

More information

Binary Principal Component Analysis in the Netflix Collaborative Filtering Task

Binary Principal Component Analysis in the Netflix Collaborative Filtering Task Binary Principal Component Analysis in the Netflix Collaborative Filtering Task László Kozma, Alexander Ilin, Tapani Raiko first.last@tkk.fi Helsinki University of Technology Adaptive Informatics Research

More information

Using SVD to Recommend Movies

Using SVD to Recommend Movies Michael Percy University of California, Santa Cruz Last update: December 12, 2009 Last update: December 12, 2009 1 / Outline 1 Introduction 2 Singular Value Decomposition 3 Experiments 4 Conclusion Last

More information

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation

More information

Least Squares Regression

Least Squares Regression E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Machine Learning CSE546 Carlos Guestrin University of Washington November 13, 2014 1 E.M.: The General Case E.M. widely used beyond mixtures of Gaussians The recipe is the same

More information

Adaptive one-bit matrix completion

Adaptive one-bit matrix completion Adaptive one-bit matrix completion Joseph Salmon Télécom Paristech, Institut Mines-Télécom Joint work with Jean Lafond (Télécom Paristech) Olga Klopp (Crest / MODAL X, Université Paris Ouest) Éric Moulines

More information

CPSC 340 Assignment 4 (due November 17 ATE)

CPSC 340 Assignment 4 (due November 17 ATE) CPSC 340 Assignment 4 due November 7 ATE) Multi-Class Logistic The function example multiclass loads a multi-class classification datasetwith y i {,, 3, 4, 5} and fits a one-vs-all classification model

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

Based on slides by Richard Zemel

Based on slides by Richard Zemel CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information

Generalized Principal Component Analysis: Projection of Saturated Model Parameters

Generalized Principal Component Analysis: Projection of Saturated Model Parameters Generalized Principal Component Analysis: Projection of Saturated Model Parameters Andrew J. Landgraf and Yoonkyung Lee Abstract Principal component analysis (PCA) is very useful for a wide variety of

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner Fundamentals CS 281A: Statistical Learning Theory Yangqing Jia Based on tutorial slides by Lester Mackey and Ariel Kleiner August, 2011 Outline 1 Probability 2 Statistics 3 Linear Algebra 4 Optimization

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

PCA, Kernel PCA, ICA

PCA, Kernel PCA, ICA PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per

More information

Lecture 7: Con3nuous Latent Variable Models

Lecture 7: Con3nuous Latent Variable Models CSC2515 Fall 2015 Introduc3on to Machine Learning Lecture 7: Con3nuous Latent Variable Models All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 13: Learning in Gaussian Graphical Models, Non-Gaussian Inference, Monte Carlo Methods Some figures

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

Advanced Introduction to Machine Learning

Advanced Introduction to Machine Learning 10-715 Advanced Introduction to Machine Learning Homework 3 Due Nov 12, 10.30 am Rules 1. Homework is due on the due date at 10.30 am. Please hand over your homework at the beginning of class. Please see

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Latent Variable View of EM. Sargur Srihari

Latent Variable View of EM. Sargur Srihari Latent Variable View of EM Sargur srihari@cedar.buffalo.edu 1 Examples of latent variables 1. Mixture Model Joint distribution is p(x,z) We don t have values for z 2. Hidden Markov Model A single time

More information

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018 CPSC 340: Machine Learning and Data Mining Sparse Matrix Factorization Fall 2018 Last Time: PCA with Orthogonal/Sequential Basis When k = 1, PCA has a scaling problem. When k > 1, have scaling, rotation,

More information

approximation algorithms I

approximation algorithms I SUM-OF-SQUARES method and approximation algorithms I David Steurer Cornell Cargese Workshop, 201 meta-task encoded as low-degree polynomial in R x example: f(x) = i,j n w ij x i x j 2 given: functions

More information

Modeling User Rating Profiles For Collaborative Filtering

Modeling User Rating Profiles For Collaborative Filtering Modeling User Rating Profiles For Collaborative Filtering Benjamin Marlin Department of Computer Science University of Toronto Toronto, ON, M5S 3H5, CANADA marlin@cs.toronto.edu Abstract In this paper

More information

Scaling Neighbourhood Methods

Scaling Neighbourhood Methods Quick Recap Scaling Neighbourhood Methods Collaborative Filtering m = #items n = #users Complexity : m * m * n Comparative Scale of Signals ~50 M users ~25 M items Explicit Ratings ~ O(1M) (1 per billion)

More information

Generative Models for Discrete Data

Generative Models for Discrete Data Generative Models for Discrete Data ddebarr@uw.edu 2016-04-21 Agenda Bayesian Concept Learning Beta-Binomial Model Dirichlet-Multinomial Model Naïve Bayes Classifiers Bayesian Concept Learning Numbers

More information

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite

More information

Generalized Linear Models in Collaborative Filtering

Generalized Linear Models in Collaborative Filtering Hao Wu CME 323, Spring 2016 WUHAO@STANFORD.EDU Abstract This study presents a distributed implementation of the collaborative filtering method based on generalized linear models. The algorithm is based

More information

Study Notes on the Latent Dirichlet Allocation

Study Notes on the Latent Dirichlet Allocation Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection

More information

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model Kai Yu Joint work with John Lafferty and Shenghuo Zhu NEC Laboratories America, Carnegie Mellon University First Prev Page

More information

Expectation Maximization Algorithm

Expectation Maximization Algorithm Expectation Maximization Algorithm Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein, Luke Zettlemoyer and Dan Weld The Evils of Hard Assignments? Clusters

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

K-Means and Gaussian Mixture Models

K-Means and Gaussian Mixture Models K-Means and Gaussian Mixture Models David Rosenberg New York University October 29, 2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016 1 / 42 K-Means Clustering K-Means Clustering David

More information

Collaborative topic models: motivations cont

Collaborative topic models: motivations cont Collaborative topic models: motivations cont Two topics: machine learning social network analysis Two people: " boy Two articles: article A! girl article B Preferences: The boy likes A and B --- no problem.

More information

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression Behavioral Data Mining Lecture 7 Linear and Logistic Regression Outline Linear Regression Regularization Logistic Regression Stochastic Gradient Fast Stochastic Methods Performance tips Linear Regression

More information

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory Statistical Inference Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory IP, José Bioucas Dias, IST, 2007

More information

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28 Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models 1 / 28 Topics Standard sparse regression model algorithms: convex relaxation and greedy algorithm sparse recovery analysis:

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin 1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)

More information

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis Lecture 7: Matrix completion Yuejie Chi The Ohio State University Page 1 Reference Guaranteed Minimum-Rank Solutions of Linear

More information

But if z is conditioned on, we need to model it:

But if z is conditioned on, we need to model it: Partially Unobserved Variables Lecture 8: Unsupervised Learning & EM Algorithm Sam Roweis October 28, 2003 Certain variables q in our models may be unobserved, either at training time or at test time or

More information

The Variational Gaussian Approximation Revisited

The Variational Gaussian Approximation Revisited The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much

More information