Gray-box Inference for Structured Gaussian Process Models: Supplementary Material

Size: px
Start display at page:

Download "Gray-box Inference for Structured Gaussian Process Models: Supplementary Material"

Transcription

1 Gray-box Inference for Structured Gaussian Process Models: Supplementary Material Pietro Galliani, Amir Dezfouli, Edwin V. Bonilla and Novi Quadrianto February 8, Proof of Theorem Here we proof the result that we can estimate the expected log lielihood and its gradients using expectations over low-dimensional Gaussians, that is that Theorem. For the structured gp model ined in our paper the expected log lielihood over the given variational distribution and its gradients can be estimated using expectations over T n -dimensional Gaussians and V -dimensional Gaussians, where T n is the length of each sequence and V is the vocabulary size. 1.1 Estimation of L ell in the full (non-sparse) model For the L ell we have that: L ell Nseq log p(y n f n ) f bin f bin 1 q(f un )q(f bin ) q(f un )q(f bin ) log p(y n f n ) df un df bin () f un un fn )q(fn un )q(f bin ) log p(y n f n ) df un un dfn df bin (3) f un n f un \n q(f un \n log p(y n f n ) q(f un n )q(f bin ) (4) K π log p(y n f n ) qn (f un n )q(f bin ), (5) where q n (fn un ) is a (T n V )-dimensional Gaussian with bloc-diagonal covariance Σ (n), each bloc of size T n T n. Therefore, we can estimate the above term by sampling from T n -dimensional Gaussians independently. Furthermore, q(f bin ) is a V -dimensional Gaussian, which can also be sampled independently. In practice, we can assume that the covariance of q(f bin ) is diagonal and we only sample from unary Gaussians for the pairwise functions. \n (1) 1

2 1. Gradients Taing the gradients of the th term for the nth sequence in the L ell : L (,n) ell λ un L(,n) ell log p(y n f n ) qn (fn un)q(f bin ) (6) q n (fn un )q(f bin ) log p(y n f n ) dfn un df bin (7) f bin fn un q n (fn un )q(f bin ) log p(y n f n ) λ un log q n(fn un ) dfn un df bin (8) f bin f un n log p(y n f n ) λ un log q n(fn un ), (9) q n (fn un)q(f bin ) where we have used the fact that x f(x) f(x) x log f(x) for any nonnegative function f(x) Similarly. the gradients of the parameters of the distribution over binary functions can be estimated using: λ binl (,n) ell log p(y n f n ) λ bin log q(f bin ) q n (f un n )q(f bin ). (10) KL terms in the sparse model The KL term (L ) in the variational objective (L elbo ) is composed of a KL divergence between the approximate posteriors and the priors over the inducing variables and pairwise functions: L KL(q(u) p(u)) KL(q(f bin ) p(f bin )), (11) }{{}}{{} L un L bin where, as the approximate posterior and the prior over the pairwise functions are Gaussian, the KL over pairwise functions can be computed analytically: L bin KL(q(f bin ) p(f bin )) KL(N (f bin ; m bin, S bin ) N (f bin ; 0, K bin )) (1) 1 ( log K bin log S bin + (m bin ) T (K bin ) 1 m bin + tr (K bin ) 1 S bin V ). (13) For the distributions over the unary functions we need to compute a KL divergence between a mixture of Gaussians and a Gaussian. For this we consider the decomposition of the KL divergence as follows: L un KL(q(u) p(u)) E q [ log q(u)] }{{} L ent where the entropy term (L ent ) can be lower bounded using Jensen s inequality: 1 l1 + E q [log p(u)] }{{}, (14) L cross K K L ent π log π l N (m ; m l, S + S l ) ˆL ent. (15) and the negative cross-entropy term (L cross ) can be computed exactly: L cross 1 K 1 [M log π + log κ(z j, Z j ) + m T jκ(z j, Z j ) 1 m j + tr κ(z j, Z j ) 1 S j ]. (16) π V j1

3 3 Proof of Theorem 3 To prove Theorem 3 we will express the expected log lielihood term in the same form as that given in Equation (5), showing that the resulting q n (fn un ) is also a (T n V )-dimensional Gaussian with bloc-diagonal covariance, having V blocs each of dimensions T n T n. We start by taing the given L ell, where the expectations are over the joint posterior q(f, u λ) p(f un u)q(u)q(f bin ): L ell Nseq where our our approximating distribution is: log p(y n f n ) p(f un u)q(u)q(f bin ) (17) log p(y f) q(u)p(f un u)du q(f bin )df, (18) f u } {{ } q(f un ) which can be computed analytically: q(f) q(f un )q(f bin ) (19) q(f un ) q(u)p(f un u)du, (0) u K K q(f un ) π q (f un ) 1 1 V π j1 N (f un j ; b j, Σ j ) (1) b j A j m j () Σ j K j + A j S j A T j. (3) We note in Equation (1) that q (f un ) has a bloc diagonal structure, which implies that we have the same expression for the L ell as in Equation (5). Therefore, we obtain analogous estimates: L ell K 1 π log p(y n f n ) qn (f un n )q(f bin ), (4) Here, as before, q n (fn un ) is a (T n V ) dimensional Gaussian with bloc-diagonal covariance Σ (n), each bloc of size T n T n. The main difference in this (sparse) case is that b (n) and Σ (n) are constrained by the expressions in Equations () and (3). Hence, the proof for the gradients follows the same derivation as in 1. above. 4 Gradients of L elbo for sparse model Here we give the gradients of the variational objective wrt the parameters for the variational distributions over the inducing variables, pairwise functions and hyper-parameters. 4.1 Inducing variables KL term As the structured lielihood does not affect the KL divergence term, the gradients corresponding to this term are similar to those in the non-structured case (Dezfouli and Bonilla, 015). Let K zz be the bloc-diagonal 3

4 covariance with V blocs κ(z j, Z j ), j 1,... Q. Additionally, lets assume the following initions: C S + S l, (5) N N (m ; m l, C ), (6) K z π l N. (7) The gradients of L wrt the posterior mean and posterior covariance for component are: l1 m L cross π K 1 zz m, (8) S L cross 1 π K 1 zz (9) V π L cross 1 [M log π + log κ(z j, Z j ) + m T jκ(z j, Z j ) 1 m j + tr κ(z j, Z j ) 1 S j ], (30) j1 where we note that we compute K 1 zz by inverting the corresponding blocs κ(z j, Z j ) independently. The gradients of the entropy term wrt the variational parameters are: m ˆLent π S ˆLent 1 π K l1 π l ( N z K l1 π ˆLent log z π l ( N z K l Expected log lielihood term + N ) C 1 z (m m l ), (31) l + N ) [C 1 C 1 z (m m l )(m m l ) T C 1 ], (3) l π l N z l. Retaing the gradients in the full model in Equation (9), we have that: λ un L(,n) ell log p(y n f n ) λ un log q n(fn un ), (33) q n (fn un)q(f bin ) where the variational parameters λ un are the posterior means and covariances ({m j } and {S j }) of the inducing variables. As given in Equation (1), q (f un ) factorizes over the latent process (j 1,..., V ), so do the marginals q n (fn un ), hence: λ un log q n(fn un ) λ un V j1 log N (f un nj ; b jn, Σ jn ), (34) where each of the distributions in Equation (34) is a T n dimensional Gaussian. Let us assume the following initions: X n : all feature vectors corresponding to sequence n (35) A jn κ(x n, Z j )κ(z j, Z j ) 1 (36) K n j κ j (X n, X n ) A jn κ(z j, X n ), therefore: (37) b jn A jn m j, (38) Σ jn K n j + A jn S j A T jn. (39) 4

5 Hence, the gradients of log q (f un ) wrt the the variational parameters of the unary posterior distributions over the inducing points are: mj log q n (fn un ) A T jnσ 1 ( ) jn f un nj b jn, (40) Sj log q n (fn un ) 1 [ ] AT jn Σ 1 un jn (fnj b jn )(fnj un b jn ) T Σ 1 jn Σ 1 jn A jn (41) Therefore, the gradients of L ell wrt the parameters of the distributions over unary functions are: mj L ell π N S κ(z seq j, Z j ) 1 κ(z j, X n )(Σ jn ) 1 Sj L ell π N seq S Pairwise functions The gradients of the L bin A T jn { S i1 S i1 (f un nij b jn ) log p(y n f un ni, f bin i ), (4) [ (Σjn ) 1 (f un nij b jn )((f un nj ) (,i) b jn ) T (Σ jn ) 1 (43) (Σ jn ) 1] } log p(y n fni, un fi bin ) A jn wrt the parameters of the posterior over pairwise functions are given by: m binl bin (K bin ) 1 m bin (44) S binl bin 1 ( (S bin ) 1 (K bin ) 1) (45) The gradients of the L ell wrt the parameters of the posterior over pairwise functions are given by: m binl ell 1 S S binl ell 1 N seq S K S π 1 i1 K π 1 i1 (S bin ) 1 (f bin i S [(S bin ) 1 (f bin i 5 Piecewise pseudolielihood m bin ) log p(y n f un ni, f bin i ) (46) m bin )(f bin i m bin ) T (S bin ) 1 (S bin ) 1 ] log p(y n f un ni, f bin i ) Piecewise pseudolielihood (Sutton and McCallum, 007) approximates the lielihood p(y n fni un, f i bin ) of sequence n given the latent functions fni un bin and fi by computing the product, 1 for every single factor and every variable occurring in it, of the conditional probability of the variable given its neighbours with respect to that factor. In our linear model, this yields the following expression for the log pseudolielihood p(y n fni un, f i bin ): log p(y n f un ni, f bin i ) W n w1 log p((y n ) w f un ni) + w 1 w 1 where W n is the number of words in sentence n and log p((y n ) w1 (y n ) w, f bin i ) W n w1 (47) log p((y n ) w f un ni, f bin i ) p((y n ) w f un ni) exp(f un niy n (w)); (48) p((y n ) w (y n ) w+1, fi bin ) exp(fi bin ((y n ) w, (y n ) w+1 )); (49) p((y n ) w+1 (y n ) w, fi bin ) exp(fi bin ((y n ) w, (y n ) w+1 )). (50) 1 Here i represents the index of the specific samples fni un and f bin i taen from our distributions q n (fn un) and q(f bin ). 5

6 6 Experiments 6.1 Experimental set-up Before starting all experiments, our program selected the positions of the 500 inducing points by performing K-means clustering on the training data. The initial values of the means (one mean value for every inducing point p and for every possible label l) are set as the fraction of the points of the training set in the cluster whose centroid is p that belong to label l. As described in Algorithm 1, we adaptively choose the correct learning rates for all parameters by searching (beginning from an initial guess, and doubling or dividing it by two as needed) for the biggest step size that causes the objective to decrease over twenty stochastic optimization steps. The step size to use for the given parameter is taen as one fourth of this value, to avoid instability. The initial step sizes were 0.5 for unary mean parameters, for binary mean parameters, for unary covariance parameeters, for binary covariance parameters and 1.0 for (linear) ernel hyperparameters (these values were selected only to speed up, insofar as possible, the process of parameter search); and the optimal step sizes were found in the order unary mean unary covariance binary mean binary covariance ernel hyperparameter. Of course this is not an exhaustive grid search; but it is less computationally expensive and wors well in practice. 6. Optimization Then we optimize the three sets of parameters (unaries, binaries, hyperparameters) in a global loop, in the same order as mentioned earlier, through standard Stochastic Gradient Descent, until the time limit is reached, using 4000 new random samples each step for the estimation of the relevant gradients and averages. For the small-scale experiments, the variational parameters for unary nodes are optimized for 500 iterations, variational parameters for pairwise nodes are optimized for 100 iterations, and hyper-parameters are updated for 0 iterations. We eep a tight bound (10 in the unary case, 0 in the binary one) on the maximum possible absolute values that covariance parameters can tae. If an update would bring their value beyond it, we recompute the gradients with new samples: oftentimes, this resolves the issue (in brief, because the faulty update was due to a bad estimation of the gradients). If the problem persists, we disregard the current sentence and move on to the next one; and if after ten attempts the problem still persists, we move bac to the latest safe position that did not cause out-of-bound errors. In this way, we can use a relatively small number of samples and maintain rather aggressive step sizes while recovering neatly from out-of-bounds errors. The setting for the large-scale experiments was similar, except that the optimization schedule was (1500, 500, 500) (that is 1500 unary optimization steps, 500 binary ones and 500 hyperparameter ones) for the big base np and chuning experiments. All the experiments were run for four hours, not counting initial clustering or final prediction (but counting step size selection). For comparison, the (small-scale) gp-ess experiments, which were run for 50,000 elliptical slice sampling steps, too on average hours for chuning, 1.19 hours for base np,.07 hours for segmentation and for japanese ne. 6.3 Performance profiles Figure 1 shows the performance of our algorithm as a function of time. We see that the test lielihood decreases very regularly in all the folds and so does overall the error rate, albeit with more variability. The bul of the optimization, both with respect to the test lielihood and with respect to the error rate, occurs during the first 10 minutes. This suggests that the ind of approach described in this paper might be particularly suited for cases in which expected loglielihood of the prediction and speed of convergence are priorities. When computing the gradient of the ernel hyperparameter, 8000 samples were used instead to insure greater stability. 6

7 Algorithm 1 Method for computing the step size 1: function chec_stepsize(parameters, step_size, num_to_chec0) : to_chec num_to_chec sentences at random from the training set; 3: old_obj current value of the objective function; 4: for i 0... num_to_chec do 5: grad gradient of objective wrt parameters; 6: parameters parameters - step_size grad; 7: if parameters are out of bounds then 8: return False; 9: new_obj current value of the objective function; 10: if new_obj < old_obj then 11: return True; 1: else 13: return False; 14: function search_stepsize_up(parameters, initial_step_size) 15: step_size initial_step_size; 16: old_params current parameter values of the model; 17: while True do 18: is_good chec_stepsize(parameters, step_size); 19: if is_good False then 0: parameters old_params; 1: return step_size factor ; : step_size step_size factor; 3: function search_stepsize_down(parameters, initial_step_size) 4: step_size initial_step_size; 5: old_params current parameter values of the model; 6: while True do 7: is_good chec_stepsize(parameters, step_size); 8: if is_good True then 9: parameters old_params; 30: return step_size factor ; 31: step_size step_size/factor; 3: function choose_stepsize(parameters, initial_step_size, factor ) 33: step_size search_stepsize_up(parameters, initial_step_size, factor); 34: if step_size initial_step_size factor then 35: step_size search_stepsize_down(parameters, initial_step_size, factor); 36: return step_size/factor; 7

8 Figure 1: The test performance of gp-var-t on chuning for the large scale experiment as a function of time. References Amir Dezfouli and Edwin V Bonilla. Scalable inference for Gaussian process models with blac-box lielihoods. In NIPS Charles Sutton and Andrew McCallum. Piecewise pseudolielihood for efficient training of conditional random fields. In ICML,

But if z is conditioned on, we need to model it:

But if z is conditioned on, we need to model it: Partially Unobserved Variables Lecture 8: Unsupervised Learning & EM Algorithm Sam Roweis October 28, 2003 Certain variables q in our models may be unobserved, either at training time or at test time or

More information

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means

More information

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means

More information

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z) CSC2515 Machine Learning Sam Roweis Lecture 8: Unsupervised Learning & EM Algorithm October 31, 2006 Partially Unobserved Variables 2 Certain variables q in our models may be unobserved, either at training

More information

Sparse Approximations for Non-Conjugate Gaussian Process Regression

Sparse Approximations for Non-Conjugate Gaussian Process Regression Sparse Approximations for Non-Conjugate Gaussian Process Regression Thang Bui and Richard Turner Computational and Biological Learning lab Department of Engineering University of Cambridge November 11,

More information

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM. Université du Sud Toulon - Var Master Informatique Probabilistic Learning and Data Analysis TD: Model-based clustering by Faicel CHAMROUKHI Solution The aim of this practical wor is to show how the Classification

More information

Quantitative Biology II Lecture 4: Variational Methods

Quantitative Biology II Lecture 4: Variational Methods 10 th March 2015 Quantitative Biology II Lecture 4: Variational Methods Gurinder Singh Mickey Atwal Center for Quantitative Biology Cold Spring Harbor Laboratory Image credit: Mike West Summary Approximate

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm The Expectation-Maximization Algorithm Francisco S. Melo In these notes, we provide a brief overview of the formal aspects concerning -means, EM and their relation. We closely follow the presentation in

More information

Probabilistic Graphical Models & Applications

Probabilistic Graphical Models & Applications Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with

More information

Two Useful Bounds for Variational Inference

Two Useful Bounds for Variational Inference Two Useful Bounds for Variational Inference John Paisley Department of Computer Science Princeton University, Princeton, NJ jpaisley@princeton.edu Abstract We review and derive two lower bounds on the

More information

Clustering with k-means and Gaussian mixture distributions

Clustering with k-means and Gaussian mixture distributions Clustering with k-means and Gaussian mixture distributions Machine Learning and Object Recognition 2017-2018 Jakob Verbeek Clustering Finding a group structure in the data Data in one cluster similar to

More information

An Introduction to Expectation-Maximization

An Introduction to Expectation-Maximization An Introduction to Expectation-Maximization Dahua Lin Abstract This notes reviews the basics about the Expectation-Maximization EM) algorithm, a popular approach to perform model estimation of the generative

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

Bayesian Data Fusion with Gaussian Process Priors : An Application to Protein Fold Recognition

Bayesian Data Fusion with Gaussian Process Priors : An Application to Protein Fold Recognition Bayesian Data Fusion with Gaussian Process Priors : An Application to Protein Fold Recognition Mar Girolami 1 Department of Computing Science University of Glasgow girolami@dcs.gla.ac.u 1 Introduction

More information

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a

More information

Gaussian Processes for Big Data. James Hensman

Gaussian Processes for Big Data. James Hensman Gaussian Processes for Big Data James Hensman Overview Motivation Sparse Gaussian Processes Stochastic Variational Inference Examples Overview Motivation Sparse Gaussian Processes Stochastic Variational

More information

Study Notes on the Latent Dirichlet Allocation

Study Notes on the Latent Dirichlet Allocation Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection

More information

Automated Variational Inference for Gaussian Process Models

Automated Variational Inference for Gaussian Process Models Automated Variational Inference for Gaussian Process Models Trung V. Nguyen ANU & NICTA VanTrung.Nguyen@nicta.com.au Edwin V. Bonilla The University of New South Wales e.bonilla@unsw.edu.au Abstract We

More information

Non-Gaussian likelihoods for Gaussian Processes

Non-Gaussian likelihoods for Gaussian Processes Non-Gaussian likelihoods for Gaussian Processes Alan Saul University of Sheffield Outline Motivation Laplace approximation KL method Expectation Propagation Comparing approximations GP regression Model

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 6 Jan-Willem van de Meent (credit: Yijun Zhao, Chris Bishop, Andrew Moore, Hastie et al.) Project Project Deadlines 3 Feb: Form teams of

More information

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:

More information

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression Group Prof. Daniel Cremers 4. Gaussian Processes - Regression Definition (Rep.) Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.

More information

HOMEWORK #4: LOGISTIC REGRESSION

HOMEWORK #4: LOGISTIC REGRESSION HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2019 Due: 11am Monday, February 25th, 2019 Submit scan of plots/written responses to Gradebook; submit your

More information

arxiv: v2 [stat.ml] 15 Aug 2017

arxiv: v2 [stat.ml] 15 Aug 2017 Worshop trac - ICLR 207 REINTERPRETING IMPORTANCE-WEIGHTED AUTOENCODERS Chris Cremer, Quaid Morris & David Duvenaud Department of Computer Science University of Toronto {ccremer,duvenaud}@cs.toronto.edu

More information

Clustering with k-means and Gaussian mixture distributions

Clustering with k-means and Gaussian mixture distributions Clustering with k-means and Gaussian mixture distributions Machine Learning and Category Representation 2014-2015 Jakob Verbeek, ovember 21, 2014 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.14.15

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some

More information

10-701/15-781, Machine Learning: Homework 4

10-701/15-781, Machine Learning: Homework 4 10-701/15-781, Machine Learning: Homewor 4 Aarti Singh Carnegie Mellon University ˆ The assignment is due at 10:30 am beginning of class on Mon, Nov 15, 2010. ˆ Separate you answers into five parts, one

More information

Single-tree GMM training

Single-tree GMM training Single-tree GMM training Ryan R. Curtin May 27, 2015 1 Introduction In this short document, we derive a tree-independent single-tree algorithm for Gaussian mixture model training, based on a technique

More information

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions

More information

Clustering, K-Means, EM Tutorial

Clustering, K-Means, EM Tutorial Clustering, K-Means, EM Tutorial Kamyar Ghasemipour Parts taken from Shikhar Sharma, Wenjie Luo, and Boris Ivanovic s tutorial slides, as well as lecture notes Organization: Clustering Motivation K-Means

More information

Active and Semi-supervised Kernel Classification

Active and Semi-supervised Kernel Classification Active and Semi-supervised Kernel Classification Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London Work done in collaboration with Xiaojin Zhu (CMU), John Lafferty (CMU),

More information

Variational Model Selection for Sparse Gaussian Process Regression

Variational Model Selection for Sparse Gaussian Process Regression Variational Model Selection for Sparse Gaussian Process Regression Michalis K. Titsias School of Computer Science University of Manchester 7 September 2008 Outline Gaussian process regression and sparse

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan Lecture 3: Latent Variables Models and Learning with the EM Algorithm Sam Roweis Tuesday July25, 2006 Machine Learning Summer School, Taiwan Latent Variable Models What to do when a variable z is always

More information

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu Lecture: Gaussian Process Regression STAT 6474 Instructor: Hongxiao Zhu Motivation Reference: Marc Deisenroth s tutorial on Robot Learning. 2 Fast Learning for Autonomous Robots with Gaussian Processes

More information

Variational Inference with Copula Augmentation

Variational Inference with Copula Augmentation Variational Inference with Copula Augmentation Dustin Tran 1 David M. Blei 2 Edoardo M. Airoldi 1 1 Department of Statistics, Harvard University 2 Department of Statistics & Computer Science, Columbia

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori 4 6 8 4 6 8 4 6 8 4 6 8 5 5 5 5 5 5 4 6 8 4 4 6 8 4 5 5 5 5 5 5 µ, Σ) α f Learningscale is slightly Parameters is slightly larger larger

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Gaussian Process Regression Networks

Gaussian Process Regression Networks Gaussian Process Regression Networks Andrew Gordon Wilson agw38@camacuk mlgengcamacuk/andrew University of Cambridge Joint work with David A Knowles and Zoubin Ghahramani June 27, 2012 ICML, Edinburgh

More information

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian

More information

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints Thang D. Bui Richard E. Turner tdb40@cam.ac.uk ret26@cam.ac.uk Computational and Biological Learning

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

The Variational Gaussian Approximation Revisited

The Variational Gaussian Approximation Revisited The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much

More information

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab

More information

State Space Gaussian Processes with Non-Gaussian Likelihoods

State Space Gaussian Processes with Non-Gaussian Likelihoods State Space Gaussian Processes with Non-Gaussian Likelihoods Hannes Nickisch 1 Arno Solin 2 Alexander Grigorievskiy 2,3 1 Philips Research, 2 Aalto University, 3 Silo.AI ICML2018 July 13, 2018 Outline

More information

Integrated Non-Factorized Variational Inference

Integrated Non-Factorized Variational Inference Integrated Non-Factorized Variational Inference Shaobo Han, Xuejun Liao and Lawrence Carin Duke University February 27, 2014 S. Han et al. Integrated Non-Factorized Variational Inference February 27, 2014

More information

Technical Details about the Expectation Maximization (EM) Algorithm

Technical Details about the Expectation Maximization (EM) Algorithm Technical Details about the Expectation Maximization (EM Algorithm Dawen Liang Columbia University dliang@ee.columbia.edu February 25, 2015 1 Introduction Maximum Lielihood Estimation (MLE is widely used

More information

Intelligent Systems:

Intelligent Systems: Intelligent Systems: Undirected Graphical models (Factor Graphs) (2 lectures) Carsten Rother 15/01/2015 Intelligent Systems: Probabilistic Inference in DGM and UGM Roadmap for next two lectures Definition

More information

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning) Exercises Introduction to Machine Learning SS 2018 Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning) LAS Group, Institute for Machine Learning Dept of Computer Science, ETH Zürich Prof

More information

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien Independent Component Analysis and Unsupervised Learning Jen-Tzung Chien TABLE OF CONTENTS 1. Independent Component Analysis 2. Case Study I: Speech Recognition Independent voices Nonparametric likelihood

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1

More information

Fully Bayesian Deep Gaussian Processes for Uncertainty Quantification

Fully Bayesian Deep Gaussian Processes for Uncertainty Quantification Fully Bayesian Deep Gaussian Processes for Uncertainty Quantification N. Zabaras 1 S. Atkinson 1 Center for Informatics and Computational Science Department of Aerospace and Mechanical Engineering University

More information

Expectation Maximization Mixture Models HMMs

Expectation Maximization Mixture Models HMMs 11-755 Machine Learning for Signal rocessing Expectation Maximization Mixture Models HMMs Class 9. 21 Sep 2010 1 Learning Distributions for Data roblem: Given a collection of examples from some data, estimate

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

Non-Factorised Variational Inference in Dynamical Systems

Non-Factorised Variational Inference in Dynamical Systems st Symposium on Advances in Approximate Bayesian Inference, 08 6 Non-Factorised Variational Inference in Dynamical Systems Alessandro D. Ialongo University of Cambridge and Max Planck Institute for Intelligent

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

Practical Bayesian Optimization of Machine Learning. Learning Algorithms Practical Bayesian Optimization of Machine Learning Algorithms CS 294 University of California, Berkeley Tuesday, April 20, 2016 Motivation Machine Learning Algorithms (MLA s) have hyperparameters that

More information

Learning in Markov Random Fields with Contrastive Free Energies

Learning in Markov Random Fields with Contrastive Free Energies Learning in Marov Random Fields with Contrastive Free Energies Max Welling School of Information and Computer Science University of California Irvine Irvine CA 92697-3425 USA welling@ics.uci.edu Charles

More information

Posterior Regularization

Posterior Regularization Posterior Regularization 1 Introduction One of the key challenges in probabilistic structured learning, is the intractability of the posterior distribution, for fast inference. There are numerous methods

More information

HOMEWORK #4: LOGISTIC REGRESSION

HOMEWORK #4: LOGISTIC REGRESSION HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2018 Due: Friday, February 23rd, 2018, 11:55 PM Submit code and report via EEE Dropbox You should submit a

More information

MTTTS16 Learning from Multiple Sources

MTTTS16 Learning from Multiple Sources MTTTS16 Learning from Multiple Sources 5 ECTS credits Autumn 2018, University of Tampere Lecturer: Jaakko Peltonen Lecture 6: Multitask learning with kernel methods and nonparametric models On this lecture:

More information

Bayesian Learning in Undirected Graphical Models

Bayesian Learning in Undirected Graphical Models Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ Work with: Iain Murray and Hyun-Chul

More information

Independent Component Analysis and Unsupervised Learning

Independent Component Analysis and Unsupervised Learning Independent Component Analysis and Unsupervised Learning Jen-Tzung Chien National Cheng Kung University TABLE OF CONTENTS 1. Independent Component Analysis 2. Case Study I: Speech Recognition Independent

More information

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K

More information

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data

More information

Bayesian Approach 2. CSC412 Probabilistic Learning & Reasoning

Bayesian Approach 2. CSC412 Probabilistic Learning & Reasoning CSC412 Probabilistic Learning & Reasoning Lecture 12: Bayesian Parameter Estimation February 27, 2006 Sam Roweis Bayesian Approach 2 The Bayesian programme (after Rev. Thomas Bayes) treats all unnown quantities

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

REINTERPRETING IMPORTANCE-WEIGHTED AUTOENCODERS

REINTERPRETING IMPORTANCE-WEIGHTED AUTOENCODERS Worshop trac - ICLR 207 REINTERPRETING IMPORTANCE-WEIGHTED AUTOENCODERS Chris Cremer, Quaid Morris & David Duvenaud Department of Computer Science University of Toronto {ccremer,duvenaud}@cs.toronto.edu

More information

Week 3: The EM algorithm

Week 3: The EM algorithm Week 3: The EM algorithm Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2005 Mixtures of Gaussians Data: Y = {y 1... y N } Latent

More information

c 4, < y 2, 1 0, otherwise,

c 4, < y 2, 1 0, otherwise, Fundamentals of Big Data Analytics Univ.-Prof. Dr. rer. nat. Rudolf Mathar Problem. Probability theory: The outcome of an experiment is described by three events A, B and C. The probabilities Pr(A) =,

More information

Mixtures of Gaussians. Sargur Srihari

Mixtures of Gaussians. Sargur Srihari Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm

More information

Variational Inference (11/04/13)

Variational Inference (11/04/13) STA561: Probabilistic machine learning Variational Inference (11/04/13) Lecturer: Barbara Engelhardt Scribes: Matt Dickenson, Alireza Samany, Tracy Schifeling 1 Introduction In this lecture we will further

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture Notes Fall 2009 November, 2009 Byoung-Ta Zhang School of Computer Science and Engineering & Cognitive Science, Brain Science, and Bioinformatics Seoul National University

More information

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine Nitish Srivastava nitish@cs.toronto.edu Ruslan Salahutdinov rsalahu@cs.toronto.edu Geoffrey Hinton hinton@cs.toronto.edu

More information

Clustering with k-means and Gaussian mixture distributions

Clustering with k-means and Gaussian mixture distributions Clustering with k-means and Gaussian mixture distributions Machine Learning and Category Representation 2012-2013 Jakob Verbeek, ovember 23, 2012 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.12.13

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 20: Expectation Maximization Algorithm EM for Mixture Models Many figures courtesy Kevin Murphy s

More information

Gaussian Processes (10/16/13)

Gaussian Processes (10/16/13) STA561: Probabilistic machine learning Gaussian Processes (10/16/13) Lecturer: Barbara Engelhardt Scribes: Changwei Hu, Di Jin, Mengdi Wang 1 Introduction In supervised learning, we observe some inputs

More information

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun Y. LeCun: Machine Learning and Pattern Recognition p. 1/? MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun The Courant Institute, New York University http://yann.lecun.com

More information

GAUSSIAN PROCESS REGRESSION

GAUSSIAN PROCESS REGRESSION GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The

More information

GWAS V: Gaussian processes

GWAS V: Gaussian processes GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011

More information

Probabilistic clustering

Probabilistic clustering Aprendizagem Automática Probabilistic clustering Ludwig Krippahl Probabilistic clustering Summary Fuzzy sets and clustering Fuzzy c-means Probabilistic Clustering: mixture models Expectation-Maximization,

More information

ICML Scalable Bayesian Inference on Point processes. with Gaussian Processes. Yves-Laurent Kom Samo & Stephen Roberts

ICML Scalable Bayesian Inference on Point processes. with Gaussian Processes. Yves-Laurent Kom Samo & Stephen Roberts ICML 2015 Scalable Nonparametric Bayesian Inference on Point Processes with Gaussian Processes Machine Learning Research Group and Oxford-Man Institute University of Oxford July 8, 2015 Point Processes

More information

Latent Variable Models and Expectation Maximization

Latent Variable Models and Expectation Maximization Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9 2 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 5 1 15 2 25 5 1 15 2 25 2 4 6 8 1 12 14 2 4 6 8 1 12 14 5 1 15

More information

Algorithmisches Lernen/Machine Learning

Algorithmisches Lernen/Machine Learning Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines

More information

Machine Learning Lecture Notes

Machine Learning Lecture Notes Machine Learning Lecture Notes Predrag Radivojac January 25, 205 Basic Principles of Parameter Estimation In probabilistic modeling, we are typically presented with a set of observations and the objective

More information

Faster Stochastic Variational Inference using Proximal-Gradient Methods with General Divergence Functions

Faster Stochastic Variational Inference using Proximal-Gradient Methods with General Divergence Functions Faster Stochastic Variational Inference using Proximal-Gradient Methods with General Divergence Functions Mohammad Emtiyaz Khan, Reza Babanezhad, Wu Lin, Mark Schmidt, Masashi Sugiyama Conference on Uncertainty

More information

Review and Motivation

Review and Motivation Review and Motivation We can model and visualize multimodal datasets by using multiple unimodal (Gaussian-like) clusters. K-means gives us a way of partitioning points into N clusters. Once we know which

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

Fast Inference and Learning with Sparse Belief Propagation

Fast Inference and Learning with Sparse Belief Propagation Fast Inference and Learning with Sparse Belief Propagation Chris Pal, Charles Sutton and Andrew McCallum University of Massachusetts Department of Computer Science Amherst, MA 01003 {pal,casutton,mccallum}@cs.umass.edu

More information

Review: Probabilistic Matrix Factorization. Probabilistic Matrix Factorization (PMF)

Review: Probabilistic Matrix Factorization. Probabilistic Matrix Factorization (PMF) Case Study 4: Collaborative Filtering Review: Probabilistic Matrix Factorization Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 2 th, 214 Emily Fox 214 1 Probabilistic

More information

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014. Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)

More information

Outline Lecture 3 2(40)

Outline Lecture 3 2(40) Outline Lecture 3 4 Lecture 3 Expectation aximization E and Clustering Thomas Schön Division of Automatic Control Linöping University Linöping Sweden. Email: schon@isy.liu.se Phone: 3-8373 Office: House

More information

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM Pattern Recognition and Machine Learning Chapter 9: Mixture Models and EM Thomas Mensink Jakob Verbeek October 11, 27 Le Menu 9.1 K-means clustering Getting the idea with a simple example 9.2 Mixtures

More information

Machine Learning Techniques for Computer Vision

Machine Learning Techniques for Computer Vision Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM

More information

Probability and Information Theory. Sargur N. Srihari

Probability and Information Theory. Sargur N. Srihari Probability and Information Theory Sargur N. srihari@cedar.buffalo.edu 1 Topics in Probability and Information Theory Overview 1. Why Probability? 2. Random Variables 3. Probability Distributions 4. Marginal

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University Another Walkthrough of Variational Bayes Bevan Jones Machine Learning Reading Group Macquarie University 2 Variational Bayes? Bayes Bayes Theorem But the integral is intractable! Sampling Gibbs, Metropolis

More information

Tree-structured Gaussian Process Approximations

Tree-structured Gaussian Process Approximations Tree-structured Gaussian Process Approximations Thang Bui joint work with Richard Turner MLG, Cambridge July 1st, 2014 1 / 27 Outline 1 Introduction 2 Tree-structured GP approximation 3 Experiments 4 Summary

More information

Estimating Gaussian Mixture Densities with EM A Tutorial

Estimating Gaussian Mixture Densities with EM A Tutorial Estimating Gaussian Mixture Densities with EM A Tutorial Carlo Tomasi Due University Expectation Maximization (EM) [4, 3, 6] is a numerical algorithm for the maximization of functions of several variables

More information