Probabilistic & Unsupervised Learning

Size: px
Start display at page:

Download "Probabilistic & Unsupervised Learning"

Transcription

1 Probabilistic & Unsupervised Learning Week 2: Latent Variable Models Maneesh Sahani Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London Term 1, Autumn 29

2 The story so far... Unsupervised Learning beliefs about data represented by parameterised distribution P (D θ, m) = n P (x i θ, m) i=1 learning involves estimating a value (or distribution) for θ Bayes P (θ D, m) = P (D θ, m)p (θ m) P (D m) MAP θ = argmax θ P (θ D, m) ML and (possibly) model selection P (m D) θ = argmax θ P (D θ, m) dθ P (D θ, m)p (θ m)

3 Aren t We Done? No! direct application of Bayes rule is intractable for all but the simplest models. even maximum-likelihood (or similar) learning may be prohibitive. algorithms (and approximations) are dictated by form of distribution.

4 What about these data? 1 x i x i1 joint distribution p(x i1, x i2 ) is not Normal. conditional distributions p(x i1 x i2 ) and p(x i2 x i1 ) vary, and are difficult to codify. easier to describe by an latent tri-valued discrete process s i Discrete[1/3, 1/3, 1/3] x i N (µ si, Σ)

5 Latent Variable Models Explain correlations in x by assuming some latent variables y (e.g. objects, illumination, pose) (e.g. object parts, surfaces) y P[θ y ] x y P[θ x ] (e.g. edges) p(x, y; θ x, θ y ) = p(x y; θ x )p(y; θ y ) p(x; θ x, θ y ) = dy p(x y; θ x )p(y; θ y ) (retinal image, i.e. pixels)

6 Latent variables and Gaussians x x 2 x 1 The data were measured in 3D, but are mostly specified by position on a 2D surface.

7 Principal Components Analysis Data such as these are often modelled using Principal Components Analysis (PCA). Assume data D = {x i } have zero mean (if not, subtract it!). 5 Find direction of greatest variance λ (1). λ (1) = argmax (x T nv) 2 v =1 n x x 2 Find direction orthogonal to λ (1) with greatest variance λ (2). Find direction orthogonal to {λ (1), λ (2),..., λ (n 1) } with greatest variance λ (n). x 1 Terminate when remaining variance drops below a threshold.

8 Eigendecomposition of a Covariance Matrix The eigendecomposition of a covariance matrix makes finding the PCs easy. Recall that u is an eigenvector, with scalar eigenvalue ω, of a matrix C if Cu = ωu u can have any norm, but we will define it to be unity (i.e., u T u = 1). For a covariance matrix C = xx T (which is D D, symmetric, positive semi-definite): In general there are D eigenvector-eigenvalue pairs (u (i), ω (i) ), except if two or more eigenvectors share the same eigenvalue (in which case the eigenvectors are degenerate any linear combination is also an eigenvector). The D eigenvectors are orthogonal (or orthogonalisable, if ω (i) = ω (j) ). Thus, they form an orthonormal basis. i u (i)u (i) T = I. Any vector v can be written as ( ) T v = u (i) u (i) v = i The original matrix C can be written: i (u (i) T v)u (i) = i v (i) u (i) C = i ω (i) u (i) u (i) T = USU T where U = [u (1), u (2),..., u (D) ] collects the eigenvectors and S = diag [(] ω (1), ω (2),..., ω (D) ).

9 PCA and Eigenvectors The variance in direction u (i) is (x T u (i) ) 2 = u (i) T xx T u (i) = u(i) T Cu (i) = u (i) T ω (i) u (i) = ω (i) The variance in an arbitrary direction v is ( )) 2 (x T v) 2 = (x T v (i) u (i) i = ij v (i) u (i) T Cu (j) v (j) = ij v (i) ω (j) v (j) u (i) T u (j) = i v 2 (i) ω (i) If v T v = 1, then i v2 (i) = 1 and so argmax v =1 (x T v) 2 = u (max) The direction of greatest variance is the eigenvector the largest eigenvalue. In general, the PCs are exactly the eigenvectors of the empirical covariance matrix, ordered by decreasing eigenvalue. The eigenspectrum shows how the variance is distributed across dimensions eigenvalue (variance) eigenvalue number fractional variance remaining eigenvalue number

10 PCA subspace The K principle components define the K-dimensional subspace of greatest variance x x 2 x 1 Each data point x n is associated with a projection ˆx n into the principle subspace. x n = K x n(k) λ (k) k=1 This can be used for lossy compression, denoising, recognition,...

11 Example of PCA: Eigenfaces from www-white.media.mit.edu/vismod/demos/facerec/basic.html

12 Another view of PCA: Mutual Information Problem: Given x, find y = Ax with columns of A unit vectors, s.t. I(y; x) is maximised (assuming that P (x) is Gaussian). I(y; x) = H(y) + H(x) H(y, x) = H(y) So we want to maximise the entropy of y. What is the entropy of a Gaussian? H(z) = dz p(z) ln p(z) = 1 2 ln Σ + D (1 + ln 2π) 2 Therefore we want the distribution of y to have largest volume (i.e. det of covariance matrix). Σ y = AΣ x A = AUS x U A So, A should be aligned with the columns of U which are associated with the largest eigenvalues (variances). Projection to the principal component subspace preserves the most information about the (Gaussian) data.

13 Network Interpretations and Encoder-Decoder Duality output units ˆx 1 ˆx 2 ˆx D decoder generation hidden units y 1 y K encoder recognition input units x 1 x 2 x D

14 From Supervised Learning to PCA output units ˆx 1 ˆx 2 ˆx D decoder generation hidden units y 1 y K encoder recognition input units x 1 x 2 x D A linear autoencoder neural network trained to minimise squared error learns to perform PCA (Baldi & Hornik, 1989).

15 Probabilistic PCA We could think of the projection x n as a latent variable, but it is more usual to construct a standard normal latent y. Observed vector data D = {x 1, x 2,..., x N }; x i R D y 1 y K x 1 x 2 x D So, model for observations x is still Gaussian with: p(y) = N (, I) Assumed latent variables {y 1, y 2,..., y N }; y i R K K Linear generative model: x d = Λ dk y k + ɛ d k=1 y k are independent N (, 1) Gaussian factors ɛ d are independent N (, ψ) Gaussian noise K <D p(x) = where Λ is a D K matrix. p(x y) = N (Λy, ψi) p(y)p(x y)dy = N (, ΛΛ + ψi ) This is Probabilistic PCA (ppca). The maximum-likelihood parameter Λ, for fixed ψ, lies in the K-principal subspace.

16 Probabilistic PCA 5 y y 1 x x 1 x x x 1 x y y 1 x x 1 x x x 1 x 5 5 2

17 PCA and ppca In PCA the noise is orthogonal to the subspace, and we can project x n x n trivially. In ppca, the noise is more sensible (equal in all directions). But what is the projection? Find the expected value of y n x n and then take x n = Λy n. Tactic: write p(y n, x n θ), consider x n to be fixed. What is this as a function of y n? p(y n, x n ) = p(y n )p(x n y n ) = (2π) K 2 exp{ 1 2 y n y n } 2πΨ 1 2 exp{ 1 2 (x n Λy n ) Ψ 1 (x n Λy n )} = c exp{ 1 2 [y n y n + (x n Λy n ) Ψ 1 (x n Λy n )]} = c exp{ 1 2 [y n (I + Λ Ψ 1 Λ)y n 2y n Λ Ψ 1 x n ]} = c exp{ 1 2 [y n Σ 1 y n 2y n Σ 1 µ + µ Σ 1 µ]} So Σ = (I + Λ Ψ 1 Λ) 1 = I βλ and µ = ΣΛ Ψ 1 x n = βx n. Where β = ΣΛ Ψ 1. Thus, x n = Λ(I + Λ Ψ 1 Λ) 1 Λ Ψ 1 x n = x n Ψ(ΛΛ T + Ψ) 1 x n This is not the same projection. ppca takes into account noise in the principal subspace. As ψ, ppca PCA.

18 Factor Analysis If dimensions are not equivalent, equal variance makes no sense. y 1 y K x 1 x 2 x D So, model for observations x is still Gaussian with: p(y) = N (, I) Observed vector data D = {x 1, x 2,..., x N }; x i R D Assumed latent variables {y 1, y 2,..., y N }; y i R K K Linear generative model: x d = Λ dk y k + ɛ d k=1 y k are independent N (, 1) Gaussian factors ɛ d are independent N (, Ψ dd ) Gaussian noise K <D p(x) = p(x y) = N (Λy, Ψ) p(y)p(x y)dy = N (, ΛΛ + Ψ ) where Λ is a D K matrix, and Ψ is K K and diagonal. Dimensionality Reduction: Finds a low-dimensional projection of high dimensional data that captures the correlation structure of the data.

19 Factor Analysis (cont.) y 1 y K x 1 x 2 x D ML learning finds Λ and Ψ given data parameters (corrected for symmetries): DK + D K(K 1) 2 If number of parameters > D(D+1) 2 model is not identifiable no closed form solution for ML params: N (, ΛΛ + Ψ) [Bayesian treatment would also have priors over Λ and Ψ and would average over them for prediction.]

20 Factor Analysis Projections Our analysis for ppca still applies: x n = Λ(I + Λ Ψ 1 Λ) 1 Λ Ψ 1 x n = x n Ψ(ΛΛ T + Ψ) 1 x n but now Ψ is diagonal but not spherical. Note, though, that Λ is generally different from that found by ppca.

21 Gradient Methods of Learning FA Write down negative log likelihood: 1 2 log 2π(ΛΛ + Ψ) x (ΛΛ + Ψ) 1 x Optimise w.r.t. Λ and Ψ (need matrix calculus) subject to constraints We will soon see an easier way to learn latent variable models...

22 FA vs PCA PCA and ppca are rotationally invariant; FA is not If x Ux for unitary U, then λ PCA (i) Uλ PCA (i) FA is measurement scale invariant; PCA and ppca are not If x Sx for diagonal S, then λ FA (i) SλFA (i) FA and ppca define a probabilistic model; PCA does not

23 Limitations of Gaussian, FA and PCA models Gaussian, FA and PCA models are easy to understand and use in practice. They are a convenient way to find interesting directions in very high dimensional data sets, eg as preprocessing Their problem is that they make very strong assumptions about the distribution of the data, only the mean and variance of the data are taken into account. The class of densities which can be modelled is too restrictive. 1 x i x i1 By using mixtures of simple distributions, such as Gaussians, we can expand the class of densities greatly.

24 Mixture Distributions 1 x i x i1 A mixture distribution has a single discrete latent variable: s i iid Discrete[π] x i s i P si [θ si ] Mixtures arise naturally when observations from different sources have been collated. They can also be used to approximate arbitrary distributions.

25 The Mixture Likelihood The mixture model is s i iid Discrete[π] x i s i P si [θ si ] Under the discrete distribution P (s i = m) = π m ; π m, k π m = 1 m=1 Thus, the probability (density) at a single data point x i is P (x i )= = k P (x i s i = m)p (s i = m) m=1 k π m P m (x i ; θ m ) m=1 The mixture distribution (density) is a convex combination (or weighted average) of the component distributions (densities).

26 Approximation with a Mixture of Gaussians (MoG) The component densities may be viewed as elements of a basis which can be combined to approximate arbitrary distributions. Here are examples where non-gaussian densities are modelled (aproximated) as a mixture of Gaussians. The red curves show the (weighted) Gaussians, and the blue curve the resulting density. Uniform Triangle Heavy tails Given enough mixture components we can model (almost) any density (as accurately as desired), but still only need to work with the well-known Gaussian form.

27 Clustering with a MoG

28 Clustering with a MoG In clustering applications, the latent variable s i represents the (unknown) identity of the cluster to which the ith observation belongs. Thus, the latent distribution gives the prior probability of a data point coming from each cluster. P (s i = m π) = π m Data from the mth cluster are distributed according to the mth component: P (x i s i = m) = P m (x i ) Once we observe a data point, the posterior probability distribution for the cluster it belongs to is P (s i = m x i ) = P m(x i )π m m P m(x i )π m This is often called the responsibility of the mth cluster for the ith data point.

29 The MoG likelihood Each component of a MoG is a Gaussian, with mean µ m and covariance matrix Σ m. Thus, the probability density evaluated at a set of n iid observations, D = {x 1... x n } (i.e. the likelihood) is p(d {µ m }, {Σ m }, π) = = The log of the likelihood is log p(d {µ m }, {Σ m }, π) = n i=1 n i=1 k π m N (x i µ m, Σ m ) m=1 k π m 2πΣ m 1/2 exp m=1 n log i=1 k π m 2πΣ m 1/2 exp m=1 [ 1 ] 2 (x i µ m ) T Σ 1 m (x i µ m ) [ 1 ] 2 (x i µ m ) T Σ 1 m (x i µ m ) Note that the logarithm fails to simplify the component density terms. A mixture distribution does not lie in the exponential family. Direct optimisation is not easy.

30 Maximum Likelihood for a Mixture Model The log likelihood is: n k L= log π m P m (x i ; θ m ) Its partial derivative wrt θ m is i=1 m=1 L θ m = n i=1 π m k m=1 π mp m (x i ; θ m ) P m (x i ; θ m ) θ m or, using P/ θ = P log P/ θ, = = n i=1 n i=1 π m P m (x i ; θ m ) k m=1 π mp m (x i ; θ m ) }{{} log P m (x i ; θ m ) θ m r im log P m (x i ; θ m ) θ m And its partial derivative wrt π m is L π m = n i=1 P m (x i ; θ m ) k m=1 π mp m (x i ; θ m ) = n i=1 r im π m

31 MoG Derivatives For a MoG, with θ m = {µ m, Σ m } we get L µ m = L Σ 1 m = 1 2 n i=1 r im Σ 1 m (x i µ m ) n ( r im Σm (x i µ m )(x i µ m ) T) i=1 These equations can be used (along with the derivatives wrt to π m ) for gradient based learning; e.g., taking small steps in the direction of the gradient (or using conjugate gradients).

32 The K-means Algorithm The K-means algorithm is a limiting case of the mixture of Gaussians (c.f. PCA and Factor Analysis). Take π m = 1/k and Σ m = σ 2 I, with σ 2. Then the responsibilities become binary r im δ(m, argmin l x i µ l 2 ) with 1 for the component with the closest mean and for all other components. We can then solve directly for the means by setting the gradient to. The k-means algorithm iterates these two steps: ( ) assign each point to its closest mean set r im = δ(m, argmin x i µ l 2 ) l ( update the means to the average of their assigned points set µ m = i r ) imx i i r im This usually converges within a few iterations, although the fixed point depends on the initial values chosen for µ m. The algorithm has no learning rate, but the assumptions are quite limiting.

33 A preview of the EM algorithm We wrote the k-means algorithm in terms of binary responsibilities. Suppose, instead, we used the fractional responsibilities from the full (non-limiting) MoG, but still neglected the dependence of the responsibilities on the parameters. We could then solve for both µ m and Σ m. The EM algorithm for MoGs iterates these two steps: Evaluate the responsibilities for each point given the current parameters. Optimise the parameters assuming the responsibilities stay fixed: µ m = i r imx i i i r and Σ m = r im(x i µ m )(x i µ m ) T im i r im Although this appears ad hoc, we will see (later) that it is a special case of a general algorithm, and is actually guaranteed to increase the likelihood at each iteration.

34 Issues There are several problems with the new algorithms: slow convergence for the gradient based method gradient based method may develop invalid covariance matrices local minima; the end configuration may depend on the starting state how do you adjust k? Using the likelihood alone is no good. singularities; components with a single data point will have their covariance going to zero and the likelihood will tend to infinity.

Probabilistic & Unsupervised Learning. Latent Variable Models

Probabilistic & Unsupervised Learning. Latent Variable Models Probabilistic & Unsupervised Learning Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London

More information

Week 3: The EM algorithm

Week 3: The EM algorithm Week 3: The EM algorithm Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2005 Mixtures of Gaussians Data: Y = {y 1... y N } Latent

More information

Lecture 7: Con3nuous Latent Variable Models

Lecture 7: Con3nuous Latent Variable Models CSC2515 Fall 2015 Introduc3on to Machine Learning Lecture 7: Con3nuous Latent Variable Models All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/

More information

Unsupervised Learning

Unsupervised Learning 2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and

More information

Latent Variable Models and EM Algorithm

Latent Variable Models and EM Algorithm SC4/SM8 Advanced Topics in Statistical Machine Learning Latent Variable Models and EM Algorithm Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/atsml/

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1

More information

Statistical learning. Chapter 20, Sections 1 4 1

Statistical learning. Chapter 20, Sections 1 4 1 Statistical learning Chapter 20, Sections 1 4 Chapter 20, Sections 1 4 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete

More information

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering

More information

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan Lecture 3: Latent Variables Models and Learning with the EM Algorithm Sam Roweis Tuesday July25, 2006 Machine Learning Summer School, Taiwan Latent Variable Models What to do when a variable z is always

More information

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inria.fr http://perception.inrialpes.fr/

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

More information

Factor Analysis (10/2/13)

Factor Analysis (10/2/13) STA561: Probabilistic machine learning Factor Analysis (10/2/13) Lecturer: Barbara Engelhardt Scribes: Li Zhu, Fan Li, Ni Guan Factor Analysis Factor analysis is related to the mixture models we have studied.

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 8 Continuous Latent Variable

More information

Dimensionality Reduction

Dimensionality Reduction Dimensionality Reduction Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, UCL Apr/May 2016 High dimensional data Example data: Gene Expression Example data: Web Pages Google

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

Variational Autoencoders

Variational Autoencoders Variational Autoencoders Recap: Story so far A classification MLP actually comprises two components A feature extraction network that converts the inputs into linearly separable features Or nearly linearly

More information

Machine Learning - MT & 14. PCA and MDS

Machine Learning - MT & 14. PCA and MDS Machine Learning - MT 2016 13 & 14. PCA and MDS Varun Kanade University of Oxford November 21 & 23, 2016 Announcements Sheet 4 due this Friday by noon Practical 3 this week (continue next week if necessary)

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin 1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)

More information

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inrialpes.fr http://perception.inrialpes.fr/ Outline of Lecture

More information

PCA and admixture models

PCA and admixture models PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57 Announcements HW1

More information

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University PRINCIPAL COMPONENT ANALYSIS DIMENSIONALITY

More information

Factor Analysis and Kalman Filtering (11/2/04)

Factor Analysis and Kalman Filtering (11/2/04) CS281A/Stat241A: Statistical Learning Theory Factor Analysis and Kalman Filtering (11/2/04) Lecturer: Michael I. Jordan Scribes: Byung-Gon Chun and Sunghoon Kim 1 Factor Analysis Factor analysis is used

More information

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2 Agenda Expectation-maximization

More information

Lecture 16 Deep Neural Generative Models

Lecture 16 Deep Neural Generative Models Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information

The Expectation Maximization or EM algorithm

The Expectation Maximization or EM algorithm The Expectation Maximization or EM algorithm Carl Edward Rasmussen November 15th, 2017 Carl Edward Rasmussen The EM algorithm November 15th, 2017 1 / 11 Contents notation, objective the lower bound functional,

More information

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014. Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)

More information

Lecture 4: Probabilistic Learning

Lecture 4: Probabilistic Learning DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

What is Principal Component Analysis?

What is Principal Component Analysis? What is Principal Component Analysis? Principal component analysis (PCA) Reduce the dimensionality of a data set by finding a new set of variables, smaller than the original set of variables Retains most

More information

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probabilistic & Unsupervised Learning Gaussian Processes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London

More information

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner Fundamentals CS 281A: Statistical Learning Theory Yangqing Jia Based on tutorial slides by Lester Mackey and Ariel Kleiner August, 2011 Outline 1 Probability 2 Statistics 3 Linear Algebra 4 Optimization

More information

PCA, Kernel PCA, ICA

PCA, Kernel PCA, ICA PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per

More information

Introduction PCA classic Generative models Beyond and summary. PCA, ICA and beyond

Introduction PCA classic Generative models Beyond and summary. PCA, ICA and beyond PCA, ICA and beyond Summer School on Manifold Learning in Image and Signal Analysis, August 17-21, 2009, Hven Technical University of Denmark (DTU) & University of Copenhagen (KU) August 18, 2009 Motivation

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Graphical Models for Collaborative Filtering

Graphical Models for Collaborative Filtering Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Cheng Soon Ong & Christian Walder. Canberra February June 2017 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2017 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 679 Part XIX

More information

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26 Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

COM336: Neural Computing

COM336: Neural Computing COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk

More information

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter

More information

Lehel Csató. March Faculty of Mathematics and Informatics Babeş Bolyai University, Cluj-Napoca,

Lehel Csató. March Faculty of Mathematics and Informatics Babeş Bolyai University, Cluj-Napoca, Faculty of Mathematics and Informatics Babeş Bolyai University, Cluj-Napoca, Matematika és Informatika Tanszék Babeş Bolyai Tudományegyetem, Kolozsvár March 2008 Table of Contents 1 2 methods 3 Methods

More information

CSC411 Fall 2018 Homework 5

CSC411 Fall 2018 Homework 5 Homework 5 Deadline: Wednesday, Nov. 4, at :59pm. Submission: You need to submit two files:. Your solutions to Questions and 2 as a PDF file, hw5_writeup.pdf, through MarkUs. (If you submit answers to

More information

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Mixtures of Gaussians. Sargur Srihari

Mixtures of Gaussians. Sargur Srihari Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm

More information

Lecture 9: PGM Learning

Lecture 9: PGM Learning 13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and

More information

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University SOFT CLUSTERING VS HARD CLUSTERING

More information

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang Deep Learning Basics Lecture 7: Factor Analysis Princeton University COS 495 Instructor: Yingyu Liang Supervised v.s. Unsupervised Math formulation for supervised learning Given training data x i, y i

More information

Chris Bishop s PRML Ch. 8: Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular

More information

But if z is conditioned on, we need to model it:

But if z is conditioned on, we need to model it: Partially Unobserved Variables Lecture 8: Unsupervised Learning & EM Algorithm Sam Roweis October 28, 2003 Certain variables q in our models may be unobserved, either at training time or at test time or

More information

Gaussian Mixture Models, Expectation Maximization

Gaussian Mixture Models, Expectation Maximization Gaussian Mixture Models, Expectation Maximization Instructor: Jessica Wu Harvey Mudd College The instructor gratefully acknowledges Andrew Ng (Stanford), Andrew Moore (CMU), Eric Eaton (UPenn), David Kauchak

More information

CSC 411 Lecture 12: Principal Component Analysis

CSC 411 Lecture 12: Principal Component Analysis CSC 411 Lecture 12: Principal Component Analysis Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 12-PCA 1 / 23 Overview Today we ll cover the first unsupervised

More information

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions

More information

Statistical learning. Chapter 20, Sections 1 3 1

Statistical learning. Chapter 20, Sections 1 3 1 Statistical learning Chapter 20, Sections 1 3 Chapter 20, Sections 1 3 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete

More information

Dimensionality Reduction with Principal Component Analysis

Dimensionality Reduction with Principal Component Analysis 10 Dimensionality Reduction with Principal Component Analysis Working directly with high-dimensional data, such as images, comes with some difficulties: it is hard to analyze, interpretation is difficult,

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite

More information

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

More information

Basic Principles of Unsupervised and Unsupervised

Basic Principles of Unsupervised and Unsupervised Basic Principles of Unsupervised and Unsupervised Learning Toward Deep Learning Shun ichi Amari (RIKEN Brain Science Institute) collaborators: R. Karakida, M. Okada (U. Tokyo) Deep Learning Self Organization

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori 4 6 8 4 6 8 4 6 8 4 6 8 5 5 5 5 5 5 4 6 8 4 4 6 8 4 5 5 5 5 5 5 µ, Σ) α f Learningscale is slightly Parameters is slightly larger larger

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) Principal Component Analysis (PCA) Additional reading can be found from non-assessed exercises (week 8) in this course unit teaching page. Textbooks: Sect. 6.3 in [1] and Ch. 12 in [2] Outline Introduction

More information

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians Engineering Part IIB: Module F Statistical Pattern Processing University of Cambridge Engineering Part IIB Module F: Statistical Pattern Processing Handout : Multivariate Gaussians. Generative Model Decision

More information

PMR Learning as Inference

PMR Learning as Inference Outline PMR Learning as Inference Probabilistic Modelling and Reasoning Amos Storkey Modelling 2 The Exponential Family 3 Bayesian Sets School of Informatics, University of Edinburgh Amos Storkey PMR Learning

More information

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University

More information

Clustering and Gaussian Mixture Models

Clustering and Gaussian Mixture Models Clustering and Gaussian Mixture Models Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 25, 2016 Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 1 Recap

More information

Background Mathematics (2/2) 1. David Barber

Background Mathematics (2/2) 1. David Barber Background Mathematics (2/2) 1 David Barber University College London Modified by Samson Cheung (sccheung@ieee.org) 1 These slides accompany the book Bayesian Reasoning and Machine Learning. The book and

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

CS281 Section 4: Factor Analysis and PCA

CS281 Section 4: Factor Analysis and PCA CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we

More information

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) Principal Component Analysis (PCA) Salvador Dalí, Galatea of the Spheres CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang Some slides from Derek Hoiem and Alysha

More information

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given

More information

Advanced Machine Learning & Perception

Advanced Machine Learning & Perception Advanced Machine Learning & Perception Instructor: Tony Jebara Topic 1 Introduction, researchy course, latest papers Going beyond simple machine learning Perception, strange spaces, images, time, behavior

More information

STA 414/2104: Lecture 8

STA 414/2104: Lecture 8 STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable models Background PCA

More information

THESIS COVARIANCE REGULARIZATION IN MIXTURE OF GAUSSIANS FOR HIGH-DIMENSIONAL IMAGE CLASSIFICATION. Submitted by. Daniel L Elliott

THESIS COVARIANCE REGULARIZATION IN MIXTURE OF GAUSSIANS FOR HIGH-DIMENSIONAL IMAGE CLASSIFICATION. Submitted by. Daniel L Elliott THESIS COVARIANCE REGULARIZATION IN MIXTURE OF GAUSSIANS FOR HIGH-DIMENSIONAL IMAGE CLASSIFICATION Submitted by Daniel L Elliott Department of Computer Science In partial fulfillment of the requirements

More information

Heeyoul (Henry) Choi. Dept. of Computer Science Texas A&M University

Heeyoul (Henry) Choi. Dept. of Computer Science Texas A&M University Heeyoul (Henry) Choi Dept. of Computer Science Texas A&M University hchoi@cs.tamu.edu Introduction Speaker Adaptation Eigenvoice Comparison with others MAP, MLLR, EMAP, RMP, CAT, RSW Experiments Future

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto Unsupervised Learning Techniques 9.520 Class 07, 1 March 2006 Andrea Caponnetto About this class Goal To introduce some methods for unsupervised learning: Gaussian Mixtures, K-Means, ISOMAP, HLLE, Laplacian

More information

Lecture 3: Pattern Classification

Lecture 3: Pattern Classification EE E6820: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 1 2 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mixtures

More information

Face Recognition and Biometric Systems

Face Recognition and Biometric Systems The Eigenfaces method Plan of the lecture Principal Components Analysis main idea Feature extraction by PCA face recognition Eigenfaces training feature extraction Literature M.A.Turk, A.P.Pentland Face

More information

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion Today Probability and Statistics Naïve Bayes Classification Linear Algebra Matrix Multiplication Matrix Inversion Calculus Vector Calculus Optimization Lagrange Multipliers 1 Classical Artificial Intelligence

More information

Linear Algebra for Machine Learning. Sargur N. Srihari

Linear Algebra for Machine Learning. Sargur N. Srihari Linear Algebra for Machine Learning Sargur N. srihari@cedar.buffalo.edu 1 Overview Linear Algebra is based on continuous math rather than discrete math Computer scientists have little experience with it

More information

Probabilistic and Bayesian Machine Learning

Probabilistic and Bayesian Machine Learning Probabilistic and Bayesian Machine Learning Lecture 1: Introduction to Probabilistic Modelling Yee Whye Teh ywteh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Why a

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides Intelligent Data Analysis and Probabilistic Inference Lecture

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Introduction to Bayesian inference

Introduction to Bayesian inference Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015 Probabilistic models Describe how data was generated using probability distributions

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information