Probabilistic & Unsupervised Learning
|
|
- Clinton Preston
- 5 years ago
- Views:
Transcription
1 Probabilistic & Unsupervised Learning Week 2: Latent Variable Models Maneesh Sahani Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London Term 1, Autumn 29
2 The story so far... Unsupervised Learning beliefs about data represented by parameterised distribution P (D θ, m) = n P (x i θ, m) i=1 learning involves estimating a value (or distribution) for θ Bayes P (θ D, m) = P (D θ, m)p (θ m) P (D m) MAP θ = argmax θ P (θ D, m) ML and (possibly) model selection P (m D) θ = argmax θ P (D θ, m) dθ P (D θ, m)p (θ m)
3 Aren t We Done? No! direct application of Bayes rule is intractable for all but the simplest models. even maximum-likelihood (or similar) learning may be prohibitive. algorithms (and approximations) are dictated by form of distribution.
4 What about these data? 1 x i x i1 joint distribution p(x i1, x i2 ) is not Normal. conditional distributions p(x i1 x i2 ) and p(x i2 x i1 ) vary, and are difficult to codify. easier to describe by an latent tri-valued discrete process s i Discrete[1/3, 1/3, 1/3] x i N (µ si, Σ)
5 Latent Variable Models Explain correlations in x by assuming some latent variables y (e.g. objects, illumination, pose) (e.g. object parts, surfaces) y P[θ y ] x y P[θ x ] (e.g. edges) p(x, y; θ x, θ y ) = p(x y; θ x )p(y; θ y ) p(x; θ x, θ y ) = dy p(x y; θ x )p(y; θ y ) (retinal image, i.e. pixels)
6 Latent variables and Gaussians x x 2 x 1 The data were measured in 3D, but are mostly specified by position on a 2D surface.
7 Principal Components Analysis Data such as these are often modelled using Principal Components Analysis (PCA). Assume data D = {x i } have zero mean (if not, subtract it!). 5 Find direction of greatest variance λ (1). λ (1) = argmax (x T nv) 2 v =1 n x x 2 Find direction orthogonal to λ (1) with greatest variance λ (2). Find direction orthogonal to {λ (1), λ (2),..., λ (n 1) } with greatest variance λ (n). x 1 Terminate when remaining variance drops below a threshold.
8 Eigendecomposition of a Covariance Matrix The eigendecomposition of a covariance matrix makes finding the PCs easy. Recall that u is an eigenvector, with scalar eigenvalue ω, of a matrix C if Cu = ωu u can have any norm, but we will define it to be unity (i.e., u T u = 1). For a covariance matrix C = xx T (which is D D, symmetric, positive semi-definite): In general there are D eigenvector-eigenvalue pairs (u (i), ω (i) ), except if two or more eigenvectors share the same eigenvalue (in which case the eigenvectors are degenerate any linear combination is also an eigenvector). The D eigenvectors are orthogonal (or orthogonalisable, if ω (i) = ω (j) ). Thus, they form an orthonormal basis. i u (i)u (i) T = I. Any vector v can be written as ( ) T v = u (i) u (i) v = i The original matrix C can be written: i (u (i) T v)u (i) = i v (i) u (i) C = i ω (i) u (i) u (i) T = USU T where U = [u (1), u (2),..., u (D) ] collects the eigenvectors and S = diag [(] ω (1), ω (2),..., ω (D) ).
9 PCA and Eigenvectors The variance in direction u (i) is (x T u (i) ) 2 = u (i) T xx T u (i) = u(i) T Cu (i) = u (i) T ω (i) u (i) = ω (i) The variance in an arbitrary direction v is ( )) 2 (x T v) 2 = (x T v (i) u (i) i = ij v (i) u (i) T Cu (j) v (j) = ij v (i) ω (j) v (j) u (i) T u (j) = i v 2 (i) ω (i) If v T v = 1, then i v2 (i) = 1 and so argmax v =1 (x T v) 2 = u (max) The direction of greatest variance is the eigenvector the largest eigenvalue. In general, the PCs are exactly the eigenvectors of the empirical covariance matrix, ordered by decreasing eigenvalue. The eigenspectrum shows how the variance is distributed across dimensions eigenvalue (variance) eigenvalue number fractional variance remaining eigenvalue number
10 PCA subspace The K principle components define the K-dimensional subspace of greatest variance x x 2 x 1 Each data point x n is associated with a projection ˆx n into the principle subspace. x n = K x n(k) λ (k) k=1 This can be used for lossy compression, denoising, recognition,...
11 Example of PCA: Eigenfaces from www-white.media.mit.edu/vismod/demos/facerec/basic.html
12 Another view of PCA: Mutual Information Problem: Given x, find y = Ax with columns of A unit vectors, s.t. I(y; x) is maximised (assuming that P (x) is Gaussian). I(y; x) = H(y) + H(x) H(y, x) = H(y) So we want to maximise the entropy of y. What is the entropy of a Gaussian? H(z) = dz p(z) ln p(z) = 1 2 ln Σ + D (1 + ln 2π) 2 Therefore we want the distribution of y to have largest volume (i.e. det of covariance matrix). Σ y = AΣ x A = AUS x U A So, A should be aligned with the columns of U which are associated with the largest eigenvalues (variances). Projection to the principal component subspace preserves the most information about the (Gaussian) data.
13 Network Interpretations and Encoder-Decoder Duality output units ˆx 1 ˆx 2 ˆx D decoder generation hidden units y 1 y K encoder recognition input units x 1 x 2 x D
14 From Supervised Learning to PCA output units ˆx 1 ˆx 2 ˆx D decoder generation hidden units y 1 y K encoder recognition input units x 1 x 2 x D A linear autoencoder neural network trained to minimise squared error learns to perform PCA (Baldi & Hornik, 1989).
15 Probabilistic PCA We could think of the projection x n as a latent variable, but it is more usual to construct a standard normal latent y. Observed vector data D = {x 1, x 2,..., x N }; x i R D y 1 y K x 1 x 2 x D So, model for observations x is still Gaussian with: p(y) = N (, I) Assumed latent variables {y 1, y 2,..., y N }; y i R K K Linear generative model: x d = Λ dk y k + ɛ d k=1 y k are independent N (, 1) Gaussian factors ɛ d are independent N (, ψ) Gaussian noise K <D p(x) = where Λ is a D K matrix. p(x y) = N (Λy, ψi) p(y)p(x y)dy = N (, ΛΛ + ψi ) This is Probabilistic PCA (ppca). The maximum-likelihood parameter Λ, for fixed ψ, lies in the K-principal subspace.
16 Probabilistic PCA 5 y y 1 x x 1 x x x 1 x y y 1 x x 1 x x x 1 x 5 5 2
17 PCA and ppca In PCA the noise is orthogonal to the subspace, and we can project x n x n trivially. In ppca, the noise is more sensible (equal in all directions). But what is the projection? Find the expected value of y n x n and then take x n = Λy n. Tactic: write p(y n, x n θ), consider x n to be fixed. What is this as a function of y n? p(y n, x n ) = p(y n )p(x n y n ) = (2π) K 2 exp{ 1 2 y n y n } 2πΨ 1 2 exp{ 1 2 (x n Λy n ) Ψ 1 (x n Λy n )} = c exp{ 1 2 [y n y n + (x n Λy n ) Ψ 1 (x n Λy n )]} = c exp{ 1 2 [y n (I + Λ Ψ 1 Λ)y n 2y n Λ Ψ 1 x n ]} = c exp{ 1 2 [y n Σ 1 y n 2y n Σ 1 µ + µ Σ 1 µ]} So Σ = (I + Λ Ψ 1 Λ) 1 = I βλ and µ = ΣΛ Ψ 1 x n = βx n. Where β = ΣΛ Ψ 1. Thus, x n = Λ(I + Λ Ψ 1 Λ) 1 Λ Ψ 1 x n = x n Ψ(ΛΛ T + Ψ) 1 x n This is not the same projection. ppca takes into account noise in the principal subspace. As ψ, ppca PCA.
18 Factor Analysis If dimensions are not equivalent, equal variance makes no sense. y 1 y K x 1 x 2 x D So, model for observations x is still Gaussian with: p(y) = N (, I) Observed vector data D = {x 1, x 2,..., x N }; x i R D Assumed latent variables {y 1, y 2,..., y N }; y i R K K Linear generative model: x d = Λ dk y k + ɛ d k=1 y k are independent N (, 1) Gaussian factors ɛ d are independent N (, Ψ dd ) Gaussian noise K <D p(x) = p(x y) = N (Λy, Ψ) p(y)p(x y)dy = N (, ΛΛ + Ψ ) where Λ is a D K matrix, and Ψ is K K and diagonal. Dimensionality Reduction: Finds a low-dimensional projection of high dimensional data that captures the correlation structure of the data.
19 Factor Analysis (cont.) y 1 y K x 1 x 2 x D ML learning finds Λ and Ψ given data parameters (corrected for symmetries): DK + D K(K 1) 2 If number of parameters > D(D+1) 2 model is not identifiable no closed form solution for ML params: N (, ΛΛ + Ψ) [Bayesian treatment would also have priors over Λ and Ψ and would average over them for prediction.]
20 Factor Analysis Projections Our analysis for ppca still applies: x n = Λ(I + Λ Ψ 1 Λ) 1 Λ Ψ 1 x n = x n Ψ(ΛΛ T + Ψ) 1 x n but now Ψ is diagonal but not spherical. Note, though, that Λ is generally different from that found by ppca.
21 Gradient Methods of Learning FA Write down negative log likelihood: 1 2 log 2π(ΛΛ + Ψ) x (ΛΛ + Ψ) 1 x Optimise w.r.t. Λ and Ψ (need matrix calculus) subject to constraints We will soon see an easier way to learn latent variable models...
22 FA vs PCA PCA and ppca are rotationally invariant; FA is not If x Ux for unitary U, then λ PCA (i) Uλ PCA (i) FA is measurement scale invariant; PCA and ppca are not If x Sx for diagonal S, then λ FA (i) SλFA (i) FA and ppca define a probabilistic model; PCA does not
23 Limitations of Gaussian, FA and PCA models Gaussian, FA and PCA models are easy to understand and use in practice. They are a convenient way to find interesting directions in very high dimensional data sets, eg as preprocessing Their problem is that they make very strong assumptions about the distribution of the data, only the mean and variance of the data are taken into account. The class of densities which can be modelled is too restrictive. 1 x i x i1 By using mixtures of simple distributions, such as Gaussians, we can expand the class of densities greatly.
24 Mixture Distributions 1 x i x i1 A mixture distribution has a single discrete latent variable: s i iid Discrete[π] x i s i P si [θ si ] Mixtures arise naturally when observations from different sources have been collated. They can also be used to approximate arbitrary distributions.
25 The Mixture Likelihood The mixture model is s i iid Discrete[π] x i s i P si [θ si ] Under the discrete distribution P (s i = m) = π m ; π m, k π m = 1 m=1 Thus, the probability (density) at a single data point x i is P (x i )= = k P (x i s i = m)p (s i = m) m=1 k π m P m (x i ; θ m ) m=1 The mixture distribution (density) is a convex combination (or weighted average) of the component distributions (densities).
26 Approximation with a Mixture of Gaussians (MoG) The component densities may be viewed as elements of a basis which can be combined to approximate arbitrary distributions. Here are examples where non-gaussian densities are modelled (aproximated) as a mixture of Gaussians. The red curves show the (weighted) Gaussians, and the blue curve the resulting density. Uniform Triangle Heavy tails Given enough mixture components we can model (almost) any density (as accurately as desired), but still only need to work with the well-known Gaussian form.
27 Clustering with a MoG
28 Clustering with a MoG In clustering applications, the latent variable s i represents the (unknown) identity of the cluster to which the ith observation belongs. Thus, the latent distribution gives the prior probability of a data point coming from each cluster. P (s i = m π) = π m Data from the mth cluster are distributed according to the mth component: P (x i s i = m) = P m (x i ) Once we observe a data point, the posterior probability distribution for the cluster it belongs to is P (s i = m x i ) = P m(x i )π m m P m(x i )π m This is often called the responsibility of the mth cluster for the ith data point.
29 The MoG likelihood Each component of a MoG is a Gaussian, with mean µ m and covariance matrix Σ m. Thus, the probability density evaluated at a set of n iid observations, D = {x 1... x n } (i.e. the likelihood) is p(d {µ m }, {Σ m }, π) = = The log of the likelihood is log p(d {µ m }, {Σ m }, π) = n i=1 n i=1 k π m N (x i µ m, Σ m ) m=1 k π m 2πΣ m 1/2 exp m=1 n log i=1 k π m 2πΣ m 1/2 exp m=1 [ 1 ] 2 (x i µ m ) T Σ 1 m (x i µ m ) [ 1 ] 2 (x i µ m ) T Σ 1 m (x i µ m ) Note that the logarithm fails to simplify the component density terms. A mixture distribution does not lie in the exponential family. Direct optimisation is not easy.
30 Maximum Likelihood for a Mixture Model The log likelihood is: n k L= log π m P m (x i ; θ m ) Its partial derivative wrt θ m is i=1 m=1 L θ m = n i=1 π m k m=1 π mp m (x i ; θ m ) P m (x i ; θ m ) θ m or, using P/ θ = P log P/ θ, = = n i=1 n i=1 π m P m (x i ; θ m ) k m=1 π mp m (x i ; θ m ) }{{} log P m (x i ; θ m ) θ m r im log P m (x i ; θ m ) θ m And its partial derivative wrt π m is L π m = n i=1 P m (x i ; θ m ) k m=1 π mp m (x i ; θ m ) = n i=1 r im π m
31 MoG Derivatives For a MoG, with θ m = {µ m, Σ m } we get L µ m = L Σ 1 m = 1 2 n i=1 r im Σ 1 m (x i µ m ) n ( r im Σm (x i µ m )(x i µ m ) T) i=1 These equations can be used (along with the derivatives wrt to π m ) for gradient based learning; e.g., taking small steps in the direction of the gradient (or using conjugate gradients).
32 The K-means Algorithm The K-means algorithm is a limiting case of the mixture of Gaussians (c.f. PCA and Factor Analysis). Take π m = 1/k and Σ m = σ 2 I, with σ 2. Then the responsibilities become binary r im δ(m, argmin l x i µ l 2 ) with 1 for the component with the closest mean and for all other components. We can then solve directly for the means by setting the gradient to. The k-means algorithm iterates these two steps: ( ) assign each point to its closest mean set r im = δ(m, argmin x i µ l 2 ) l ( update the means to the average of their assigned points set µ m = i r ) imx i i r im This usually converges within a few iterations, although the fixed point depends on the initial values chosen for µ m. The algorithm has no learning rate, but the assumptions are quite limiting.
33 A preview of the EM algorithm We wrote the k-means algorithm in terms of binary responsibilities. Suppose, instead, we used the fractional responsibilities from the full (non-limiting) MoG, but still neglected the dependence of the responsibilities on the parameters. We could then solve for both µ m and Σ m. The EM algorithm for MoGs iterates these two steps: Evaluate the responsibilities for each point given the current parameters. Optimise the parameters assuming the responsibilities stay fixed: µ m = i r imx i i i r and Σ m = r im(x i µ m )(x i µ m ) T im i r im Although this appears ad hoc, we will see (later) that it is a special case of a general algorithm, and is actually guaranteed to increase the likelihood at each iteration.
34 Issues There are several problems with the new algorithms: slow convergence for the gradient based method gradient based method may develop invalid covariance matrices local minima; the end configuration may depend on the starting state how do you adjust k? Using the likelihood alone is no good. singularities; components with a single data point will have their covariance going to zero and the likelihood will tend to infinity.
Probabilistic & Unsupervised Learning. Latent Variable Models
Probabilistic & Unsupervised Learning Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London
More informationWeek 3: The EM algorithm
Week 3: The EM algorithm Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2005 Mixtures of Gaussians Data: Y = {y 1... y N } Latent
More informationLecture 7: Con3nuous Latent Variable Models
CSC2515 Fall 2015 Introduc3on to Machine Learning Lecture 7: Con3nuous Latent Variable Models All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/
More informationUnsupervised Learning
2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and
More informationLatent Variable Models and EM Algorithm
SC4/SM8 Advanced Topics in Statistical Machine Learning Latent Variable Models and EM Algorithm Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/atsml/
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1
More informationStatistical learning. Chapter 20, Sections 1 4 1
Statistical learning Chapter 20, Sections 1 4 Chapter 20, Sections 1 4 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete
More informationECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction
ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering
More informationLecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan
Lecture 3: Latent Variables Models and Learning with the EM Algorithm Sam Roweis Tuesday July25, 2006 Machine Learning Summer School, Taiwan Latent Variable Models What to do when a variable z is always
More informationManifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA
Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inria.fr http://perception.inrialpes.fr/
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V
More informationFactor Analysis (10/2/13)
STA561: Probabilistic machine learning Factor Analysis (10/2/13) Lecturer: Barbara Engelhardt Scribes: Li Zhu, Fan Li, Ni Guan Factor Analysis Factor analysis is related to the mixture models we have studied.
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationLecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions
DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K
More informationSTA 414/2104: Machine Learning
STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 8 Continuous Latent Variable
More informationDimensionality Reduction
Dimensionality Reduction Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, UCL Apr/May 2016 High dimensional data Example data: Gene Expression Example data: Web Pages Google
More informationLatent Variable Models and EM algorithm
Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic
More informationVariational Autoencoders
Variational Autoencoders Recap: Story so far A classification MLP actually comprises two components A feature extraction network that converts the inputs into linearly separable features Or nearly linearly
More informationMachine Learning - MT & 14. PCA and MDS
Machine Learning - MT 2016 13 & 14. PCA and MDS Varun Kanade University of Oxford November 21 & 23, 2016 Announcements Sheet 4 due this Friday by noon Practical 3 this week (continue next week if necessary)
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationIntroduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin
1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)
More informationData Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis
Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inrialpes.fr http://perception.inrialpes.fr/ Outline of Lecture
More informationPCA and admixture models
PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57 Announcements HW1
More informationCOMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017
COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University PRINCIPAL COMPONENT ANALYSIS DIMENSIONALITY
More informationFactor Analysis and Kalman Filtering (11/2/04)
CS281A/Stat241A: Statistical Learning Theory Factor Analysis and Kalman Filtering (11/2/04) Lecturer: Michael I. Jordan Scribes: Byung-Gon Chun and Sunghoon Kim 1 Factor Analysis Factor analysis is used
More informationParametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a
Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft
More informationUnsupervised Learning
Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2 Agenda Expectation-maximization
More informationLecture 16 Deep Neural Generative Models
Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed
More informationMachine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall
Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume
More informationThe Expectation Maximization or EM algorithm
The Expectation Maximization or EM algorithm Carl Edward Rasmussen November 15th, 2017 Carl Edward Rasmussen The EM algorithm November 15th, 2017 1 / 11 Contents notation, objective the lower bound functional,
More informationClustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.
Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)
More informationLecture 4: Probabilistic Learning
DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods
More informationProbabilistic Latent Semantic Analysis
Probabilistic Latent Semantic Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr
More informationWhat is Principal Component Analysis?
What is Principal Component Analysis? Principal component analysis (PCA) Reduce the dimensionality of a data set by finding a new set of variables, smaller than the original set of variables Retains most
More informationProbabilistic & Unsupervised Learning
Probabilistic & Unsupervised Learning Gaussian Processes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London
More informationFundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner
Fundamentals CS 281A: Statistical Learning Theory Yangqing Jia Based on tutorial slides by Lester Mackey and Ariel Kleiner August, 2011 Outline 1 Probability 2 Statistics 3 Linear Algebra 4 Optimization
More informationPCA, Kernel PCA, ICA
PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per
More informationIntroduction PCA classic Generative models Beyond and summary. PCA, ICA and beyond
PCA, ICA and beyond Summer School on Manifold Learning in Image and Signal Analysis, August 17-21, 2009, Hven Technical University of Denmark (DTU) & University of Copenhagen (KU) August 18, 2009 Motivation
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationGraphical Models for Collaborative Filtering
Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,
More informationParametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012
Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood
More informationLinear Regression and Its Applications
Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start
More informationCheng Soon Ong & Christian Walder. Canberra February June 2017
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2017 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 679 Part XIX
More informationClustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26
Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar
More informationNaïve Bayes classification
Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss
More informationCOM336: Neural Computing
COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk
More informationPattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods
Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter
More informationLehel Csató. March Faculty of Mathematics and Informatics Babeş Bolyai University, Cluj-Napoca,
Faculty of Mathematics and Informatics Babeş Bolyai University, Cluj-Napoca, Matematika és Informatika Tanszék Babeş Bolyai Tudományegyetem, Kolozsvár March 2008 Table of Contents 1 2 methods 3 Methods
More informationCSC411 Fall 2018 Homework 5
Homework 5 Deadline: Wednesday, Nov. 4, at :59pm. Submission: You need to submit two files:. Your solutions to Questions and 2 as a PDF file, hw5_writeup.pdf, through MarkUs. (If you submit answers to
More informationClustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning
Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades
More informationGaussian Mixture Models
Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some
More informationL11: Pattern recognition principles
L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction
More informationMixtures of Gaussians. Sargur Srihari
Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm
More informationLecture 9: PGM Learning
13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and
More informationCOMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017
COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University SOFT CLUSTERING VS HARD CLUSTERING
More informationDeep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang
Deep Learning Basics Lecture 7: Factor Analysis Princeton University COS 495 Instructor: Yingyu Liang Supervised v.s. Unsupervised Math formulation for supervised learning Given training data x i, y i
More informationChris Bishop s PRML Ch. 8: Graphical Models
Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular
More informationBut if z is conditioned on, we need to model it:
Partially Unobserved Variables Lecture 8: Unsupervised Learning & EM Algorithm Sam Roweis October 28, 2003 Certain variables q in our models may be unobserved, either at training time or at test time or
More informationGaussian Mixture Models, Expectation Maximization
Gaussian Mixture Models, Expectation Maximization Instructor: Jessica Wu Harvey Mudd College The instructor gratefully acknowledges Andrew Ng (Stanford), Andrew Moore (CMU), Eric Eaton (UPenn), David Kauchak
More informationCSC 411 Lecture 12: Principal Component Analysis
CSC 411 Lecture 12: Principal Component Analysis Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 12-PCA 1 / 23 Overview Today we ll cover the first unsupervised
More informationComputer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization
Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions
More informationStatistical learning. Chapter 20, Sections 1 3 1
Statistical learning Chapter 20, Sections 1 3 Chapter 20, Sections 1 3 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete
More informationDimensionality Reduction with Principal Component Analysis
10 Dimensionality Reduction with Principal Component Analysis Working directly with high-dimensional data, such as images, comes with some difficulties: it is hard to analyze, interpretation is difficult,
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project
More informationPattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite
More informationCh 4. Linear Models for Classification
Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,
More informationBasic Principles of Unsupervised and Unsupervised
Basic Principles of Unsupervised and Unsupervised Learning Toward Deep Learning Shun ichi Amari (RIKEN Brain Science Institute) collaborators: R. Karakida, M. Okada (U. Tokyo) Deep Learning Self Organization
More informationExpectation Maximization
Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori 4 6 8 4 6 8 4 6 8 4 6 8 5 5 5 5 5 5 4 6 8 4 4 6 8 4 5 5 5 5 5 5 µ, Σ) α f Learningscale is slightly Parameters is slightly larger larger
More informationSTA 414/2104: Machine Learning
STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far
More informationPrincipal Component Analysis (PCA)
Principal Component Analysis (PCA) Additional reading can be found from non-assessed exercises (week 8) in this course unit teaching page. Textbooks: Sect. 6.3 in [1] and Ch. 12 in [2] Outline Introduction
More informationUniversity of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians
Engineering Part IIB: Module F Statistical Pattern Processing University of Cambridge Engineering Part IIB Module F: Statistical Pattern Processing Handout : Multivariate Gaussians. Generative Model Decision
More informationPMR Learning as Inference
Outline PMR Learning as Inference Probabilistic Modelling and Reasoning Amos Storkey Modelling 2 The Exponential Family 3 Bayesian Sets School of Informatics, University of Edinburgh Amos Storkey PMR Learning
More informationMachine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling
Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University
More informationClustering and Gaussian Mixture Models
Clustering and Gaussian Mixture Models Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 25, 2016 Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 1 Recap
More informationBackground Mathematics (2/2) 1. David Barber
Background Mathematics (2/2) 1 David Barber University College London Modified by Samson Cheung (sccheung@ieee.org) 1 These slides accompany the book Bayesian Reasoning and Machine Learning. The book and
More informationNaïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability
Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish
More informationCS281 Section 4: Factor Analysis and PCA
CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we
More informationPrincipal Component Analysis (PCA)
Principal Component Analysis (PCA) Salvador Dalí, Galatea of the Spheres CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang Some slides from Derek Hoiem and Alysha
More informationPCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani
PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given
More informationAdvanced Machine Learning & Perception
Advanced Machine Learning & Perception Instructor: Tony Jebara Topic 1 Introduction, researchy course, latest papers Going beyond simple machine learning Perception, strange spaces, images, time, behavior
More informationSTA 414/2104: Lecture 8
STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable models Background PCA
More informationTHESIS COVARIANCE REGULARIZATION IN MIXTURE OF GAUSSIANS FOR HIGH-DIMENSIONAL IMAGE CLASSIFICATION. Submitted by. Daniel L Elliott
THESIS COVARIANCE REGULARIZATION IN MIXTURE OF GAUSSIANS FOR HIGH-DIMENSIONAL IMAGE CLASSIFICATION Submitted by Daniel L Elliott Department of Computer Science In partial fulfillment of the requirements
More informationHeeyoul (Henry) Choi. Dept. of Computer Science Texas A&M University
Heeyoul (Henry) Choi Dept. of Computer Science Texas A&M University hchoi@cs.tamu.edu Introduction Speaker Adaptation Eigenvoice Comparison with others MAP, MLLR, EMAP, RMP, CAT, RSW Experiments Future
More informationReading Group on Deep Learning Session 1
Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular
More informationUnsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto
Unsupervised Learning Techniques 9.520 Class 07, 1 March 2006 Andrea Caponnetto About this class Goal To introduce some methods for unsupervised learning: Gaussian Mixtures, K-Means, ISOMAP, HLLE, Laplacian
More informationLecture 3: Pattern Classification
EE E6820: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 1 2 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mixtures
More informationFace Recognition and Biometric Systems
The Eigenfaces method Plan of the lecture Principal Components Analysis main idea Feature extraction by PCA face recognition Eigenfaces training feature extraction Literature M.A.Turk, A.P.Pentland Face
More informationToday. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion
Today Probability and Statistics Naïve Bayes Classification Linear Algebra Matrix Multiplication Matrix Inversion Calculus Vector Calculus Optimization Lagrange Multipliers 1 Classical Artificial Intelligence
More informationLinear Algebra for Machine Learning. Sargur N. Srihari
Linear Algebra for Machine Learning Sargur N. srihari@cedar.buffalo.edu 1 Overview Linear Algebra is based on continuous math rather than discrete math Computer scientists have little experience with it
More informationProbabilistic and Bayesian Machine Learning
Probabilistic and Bayesian Machine Learning Lecture 1: Introduction to Probabilistic Modelling Yee Whye Teh ywteh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Why a
More informationSequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them
HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated
More informationLecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides
Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides Intelligent Data Analysis and Probabilistic Inference Lecture
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationIntroduction to Bayesian inference
Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015 Probabilistic models Describe how data was generated using probability distributions
More informationCSC321 Lecture 18: Learning Probabilistic Models
CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling
More informationSTA414/2104. Lecture 11: Gaussian Processes. Department of Statistics
STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations
More information