Gray-box Inference for Structured Gaussian Process Models: Supplementary Material
|
|
- Patrick Philip Howard
- 5 years ago
- Views:
Transcription
1 Gray-box Inference for Structured Gaussian Process Models: Supplementary Material Pietro Galliani, Amir Dezfouli, Edwin V. Bonilla and Novi Quadrianto February 8, Proof of Theorem Here we proof the result that we can estimate the expected log lielihood and its gradients using expectations over low-dimensional Gaussians, that is that Theorem. For the structured gp model ined in our paper the expected log lielihood over the given variational distribution and its gradients can be estimated using expectations over T n -dimensional Gaussians and V -dimensional Gaussians, where T n is the length of each sequence and V is the vocabulary size. 1.1 Estimation of L ell in the full (non-sparse) model For the L ell we have that: L ell Nseq log p(y n f n ) f bin f bin 1 q(f un )q(f bin ) q(f un )q(f bin ) log p(y n f n ) df un df bin () f un un fn )q(fn un )q(f bin ) log p(y n f n ) df un un dfn df bin (3) f un n f un \n q(f un \n log p(y n f n ) q(f un n )q(f bin ) (4) K π log p(y n f n ) qn (f un n )q(f bin ), (5) where q n (fn un ) is a (T n V )-dimensional Gaussian with bloc-diagonal covariance Σ (n), each bloc of size T n T n. Therefore, we can estimate the above term by sampling from T n -dimensional Gaussians independently. Furthermore, q(f bin ) is a V -dimensional Gaussian, which can also be sampled independently. In practice, we can assume that the covariance of q(f bin ) is diagonal and we only sample from unary Gaussians for the pairwise functions. \n (1) 1
2 1. Gradients Taing the gradients of the th term for the nth sequence in the L ell : L (,n) ell λ un L(,n) ell log p(y n f n ) qn (fn un)q(f bin ) (6) q n (fn un )q(f bin ) log p(y n f n ) dfn un df bin (7) f bin fn un q n (fn un )q(f bin ) log p(y n f n ) λ un log q n(fn un ) dfn un df bin (8) f bin f un n log p(y n f n ) λ un log q n(fn un ), (9) q n (fn un)q(f bin ) where we have used the fact that x f(x) f(x) x log f(x) for any nonnegative function f(x) Similarly. the gradients of the parameters of the distribution over binary functions can be estimated using: λ binl (,n) ell log p(y n f n ) λ bin log q(f bin ) q n (f un n )q(f bin ). (10) KL terms in the sparse model The KL term (L ) in the variational objective (L elbo ) is composed of a KL divergence between the approximate posteriors and the priors over the inducing variables and pairwise functions: L KL(q(u) p(u)) KL(q(f bin ) p(f bin )), (11) }{{}}{{} L un L bin where, as the approximate posterior and the prior over the pairwise functions are Gaussian, the KL over pairwise functions can be computed analytically: L bin KL(q(f bin ) p(f bin )) KL(N (f bin ; m bin, S bin ) N (f bin ; 0, K bin )) (1) 1 ( log K bin log S bin + (m bin ) T (K bin ) 1 m bin + tr (K bin ) 1 S bin V ). (13) For the distributions over the unary functions we need to compute a KL divergence between a mixture of Gaussians and a Gaussian. For this we consider the decomposition of the KL divergence as follows: L un KL(q(u) p(u)) E q [ log q(u)] }{{} L ent where the entropy term (L ent ) can be lower bounded using Jensen s inequality: 1 l1 + E q [log p(u)] }{{}, (14) L cross K K L ent π log π l N (m ; m l, S + S l ) ˆL ent. (15) and the negative cross-entropy term (L cross ) can be computed exactly: L cross 1 K 1 [M log π + log κ(z j, Z j ) + m T jκ(z j, Z j ) 1 m j + tr κ(z j, Z j ) 1 S j ]. (16) π V j1
3 3 Proof of Theorem 3 To prove Theorem 3 we will express the expected log lielihood term in the same form as that given in Equation (5), showing that the resulting q n (fn un ) is also a (T n V )-dimensional Gaussian with bloc-diagonal covariance, having V blocs each of dimensions T n T n. We start by taing the given L ell, where the expectations are over the joint posterior q(f, u λ) p(f un u)q(u)q(f bin ): L ell Nseq where our our approximating distribution is: log p(y n f n ) p(f un u)q(u)q(f bin ) (17) log p(y f) q(u)p(f un u)du q(f bin )df, (18) f u } {{ } q(f un ) which can be computed analytically: q(f) q(f un )q(f bin ) (19) q(f un ) q(u)p(f un u)du, (0) u K K q(f un ) π q (f un ) 1 1 V π j1 N (f un j ; b j, Σ j ) (1) b j A j m j () Σ j K j + A j S j A T j. (3) We note in Equation (1) that q (f un ) has a bloc diagonal structure, which implies that we have the same expression for the L ell as in Equation (5). Therefore, we obtain analogous estimates: L ell K 1 π log p(y n f n ) qn (f un n )q(f bin ), (4) Here, as before, q n (fn un ) is a (T n V ) dimensional Gaussian with bloc-diagonal covariance Σ (n), each bloc of size T n T n. The main difference in this (sparse) case is that b (n) and Σ (n) are constrained by the expressions in Equations () and (3). Hence, the proof for the gradients follows the same derivation as in 1. above. 4 Gradients of L elbo for sparse model Here we give the gradients of the variational objective wrt the parameters for the variational distributions over the inducing variables, pairwise functions and hyper-parameters. 4.1 Inducing variables KL term As the structured lielihood does not affect the KL divergence term, the gradients corresponding to this term are similar to those in the non-structured case (Dezfouli and Bonilla, 015). Let K zz be the bloc-diagonal 3
4 covariance with V blocs κ(z j, Z j ), j 1,... Q. Additionally, lets assume the following initions: C S + S l, (5) N N (m ; m l, C ), (6) K z π l N. (7) The gradients of L wrt the posterior mean and posterior covariance for component are: l1 m L cross π K 1 zz m, (8) S L cross 1 π K 1 zz (9) V π L cross 1 [M log π + log κ(z j, Z j ) + m T jκ(z j, Z j ) 1 m j + tr κ(z j, Z j ) 1 S j ], (30) j1 where we note that we compute K 1 zz by inverting the corresponding blocs κ(z j, Z j ) independently. The gradients of the entropy term wrt the variational parameters are: m ˆLent π S ˆLent 1 π K l1 π l ( N z K l1 π ˆLent log z π l ( N z K l Expected log lielihood term + N ) C 1 z (m m l ), (31) l + N ) [C 1 C 1 z (m m l )(m m l ) T C 1 ], (3) l π l N z l. Retaing the gradients in the full model in Equation (9), we have that: λ un L(,n) ell log p(y n f n ) λ un log q n(fn un ), (33) q n (fn un)q(f bin ) where the variational parameters λ un are the posterior means and covariances ({m j } and {S j }) of the inducing variables. As given in Equation (1), q (f un ) factorizes over the latent process (j 1,..., V ), so do the marginals q n (fn un ), hence: λ un log q n(fn un ) λ un V j1 log N (f un nj ; b jn, Σ jn ), (34) where each of the distributions in Equation (34) is a T n dimensional Gaussian. Let us assume the following initions: X n : all feature vectors corresponding to sequence n (35) A jn κ(x n, Z j )κ(z j, Z j ) 1 (36) K n j κ j (X n, X n ) A jn κ(z j, X n ), therefore: (37) b jn A jn m j, (38) Σ jn K n j + A jn S j A T jn. (39) 4
5 Hence, the gradients of log q (f un ) wrt the the variational parameters of the unary posterior distributions over the inducing points are: mj log q n (fn un ) A T jnσ 1 ( ) jn f un nj b jn, (40) Sj log q n (fn un ) 1 [ ] AT jn Σ 1 un jn (fnj b jn )(fnj un b jn ) T Σ 1 jn Σ 1 jn A jn (41) Therefore, the gradients of L ell wrt the parameters of the distributions over unary functions are: mj L ell π N S κ(z seq j, Z j ) 1 κ(z j, X n )(Σ jn ) 1 Sj L ell π N seq S Pairwise functions The gradients of the L bin A T jn { S i1 S i1 (f un nij b jn ) log p(y n f un ni, f bin i ), (4) [ (Σjn ) 1 (f un nij b jn )((f un nj ) (,i) b jn ) T (Σ jn ) 1 (43) (Σ jn ) 1] } log p(y n fni, un fi bin ) A jn wrt the parameters of the posterior over pairwise functions are given by: m binl bin (K bin ) 1 m bin (44) S binl bin 1 ( (S bin ) 1 (K bin ) 1) (45) The gradients of the L ell wrt the parameters of the posterior over pairwise functions are given by: m binl ell 1 S S binl ell 1 N seq S K S π 1 i1 K π 1 i1 (S bin ) 1 (f bin i S [(S bin ) 1 (f bin i 5 Piecewise pseudolielihood m bin ) log p(y n f un ni, f bin i ) (46) m bin )(f bin i m bin ) T (S bin ) 1 (S bin ) 1 ] log p(y n f un ni, f bin i ) Piecewise pseudolielihood (Sutton and McCallum, 007) approximates the lielihood p(y n fni un, f i bin ) of sequence n given the latent functions fni un bin and fi by computing the product, 1 for every single factor and every variable occurring in it, of the conditional probability of the variable given its neighbours with respect to that factor. In our linear model, this yields the following expression for the log pseudolielihood p(y n fni un, f i bin ): log p(y n f un ni, f bin i ) W n w1 log p((y n ) w f un ni) + w 1 w 1 where W n is the number of words in sentence n and log p((y n ) w1 (y n ) w, f bin i ) W n w1 (47) log p((y n ) w f un ni, f bin i ) p((y n ) w f un ni) exp(f un niy n (w)); (48) p((y n ) w (y n ) w+1, fi bin ) exp(fi bin ((y n ) w, (y n ) w+1 )); (49) p((y n ) w+1 (y n ) w, fi bin ) exp(fi bin ((y n ) w, (y n ) w+1 )). (50) 1 Here i represents the index of the specific samples fni un and f bin i taen from our distributions q n (fn un) and q(f bin ). 5
6 6 Experiments 6.1 Experimental set-up Before starting all experiments, our program selected the positions of the 500 inducing points by performing K-means clustering on the training data. The initial values of the means (one mean value for every inducing point p and for every possible label l) are set as the fraction of the points of the training set in the cluster whose centroid is p that belong to label l. As described in Algorithm 1, we adaptively choose the correct learning rates for all parameters by searching (beginning from an initial guess, and doubling or dividing it by two as needed) for the biggest step size that causes the objective to decrease over twenty stochastic optimization steps. The step size to use for the given parameter is taen as one fourth of this value, to avoid instability. The initial step sizes were 0.5 for unary mean parameters, for binary mean parameters, for unary covariance parameeters, for binary covariance parameters and 1.0 for (linear) ernel hyperparameters (these values were selected only to speed up, insofar as possible, the process of parameter search); and the optimal step sizes were found in the order unary mean unary covariance binary mean binary covariance ernel hyperparameter. Of course this is not an exhaustive grid search; but it is less computationally expensive and wors well in practice. 6. Optimization Then we optimize the three sets of parameters (unaries, binaries, hyperparameters) in a global loop, in the same order as mentioned earlier, through standard Stochastic Gradient Descent, until the time limit is reached, using 4000 new random samples each step for the estimation of the relevant gradients and averages. For the small-scale experiments, the variational parameters for unary nodes are optimized for 500 iterations, variational parameters for pairwise nodes are optimized for 100 iterations, and hyper-parameters are updated for 0 iterations. We eep a tight bound (10 in the unary case, 0 in the binary one) on the maximum possible absolute values that covariance parameters can tae. If an update would bring their value beyond it, we recompute the gradients with new samples: oftentimes, this resolves the issue (in brief, because the faulty update was due to a bad estimation of the gradients). If the problem persists, we disregard the current sentence and move on to the next one; and if after ten attempts the problem still persists, we move bac to the latest safe position that did not cause out-of-bound errors. In this way, we can use a relatively small number of samples and maintain rather aggressive step sizes while recovering neatly from out-of-bounds errors. The setting for the large-scale experiments was similar, except that the optimization schedule was (1500, 500, 500) (that is 1500 unary optimization steps, 500 binary ones and 500 hyperparameter ones) for the big base np and chuning experiments. All the experiments were run for four hours, not counting initial clustering or final prediction (but counting step size selection). For comparison, the (small-scale) gp-ess experiments, which were run for 50,000 elliptical slice sampling steps, too on average hours for chuning, 1.19 hours for base np,.07 hours for segmentation and for japanese ne. 6.3 Performance profiles Figure 1 shows the performance of our algorithm as a function of time. We see that the test lielihood decreases very regularly in all the folds and so does overall the error rate, albeit with more variability. The bul of the optimization, both with respect to the test lielihood and with respect to the error rate, occurs during the first 10 minutes. This suggests that the ind of approach described in this paper might be particularly suited for cases in which expected loglielihood of the prediction and speed of convergence are priorities. When computing the gradient of the ernel hyperparameter, 8000 samples were used instead to insure greater stability. 6
7 Algorithm 1 Method for computing the step size 1: function chec_stepsize(parameters, step_size, num_to_chec0) : to_chec num_to_chec sentences at random from the training set; 3: old_obj current value of the objective function; 4: for i 0... num_to_chec do 5: grad gradient of objective wrt parameters; 6: parameters parameters - step_size grad; 7: if parameters are out of bounds then 8: return False; 9: new_obj current value of the objective function; 10: if new_obj < old_obj then 11: return True; 1: else 13: return False; 14: function search_stepsize_up(parameters, initial_step_size) 15: step_size initial_step_size; 16: old_params current parameter values of the model; 17: while True do 18: is_good chec_stepsize(parameters, step_size); 19: if is_good False then 0: parameters old_params; 1: return step_size factor ; : step_size step_size factor; 3: function search_stepsize_down(parameters, initial_step_size) 4: step_size initial_step_size; 5: old_params current parameter values of the model; 6: while True do 7: is_good chec_stepsize(parameters, step_size); 8: if is_good True then 9: parameters old_params; 30: return step_size factor ; 31: step_size step_size/factor; 3: function choose_stepsize(parameters, initial_step_size, factor ) 33: step_size search_stepsize_up(parameters, initial_step_size, factor); 34: if step_size initial_step_size factor then 35: step_size search_stepsize_down(parameters, initial_step_size, factor); 36: return step_size/factor; 7
8 Figure 1: The test performance of gp-var-t on chuning for the large scale experiment as a function of time. References Amir Dezfouli and Edwin V Bonilla. Scalable inference for Gaussian process models with blac-box lielihoods. In NIPS Charles Sutton and Andrew McCallum. Piecewise pseudolielihood for efficient training of conditional random fields. In ICML,
But if z is conditioned on, we need to model it:
Partially Unobserved Variables Lecture 8: Unsupervised Learning & EM Algorithm Sam Roweis October 28, 2003 Certain variables q in our models may be unobserved, either at training time or at test time or
More informationMixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate
Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means
More informationMixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate
Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means
More informationVariables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)
CSC2515 Machine Learning Sam Roweis Lecture 8: Unsupervised Learning & EM Algorithm October 31, 2006 Partially Unobserved Variables 2 Certain variables q in our models may be unobserved, either at training
More informationSparse Approximations for Non-Conjugate Gaussian Process Regression
Sparse Approximations for Non-Conjugate Gaussian Process Regression Thang Bui and Richard Turner Computational and Biological Learning lab Department of Engineering University of Cambridge November 11,
More information1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.
Université du Sud Toulon - Var Master Informatique Probabilistic Learning and Data Analysis TD: Model-based clustering by Faicel CHAMROUKHI Solution The aim of this practical wor is to show how the Classification
More informationQuantitative Biology II Lecture 4: Variational Methods
10 th March 2015 Quantitative Biology II Lecture 4: Variational Methods Gurinder Singh Mickey Atwal Center for Quantitative Biology Cold Spring Harbor Laboratory Image credit: Mike West Summary Approximate
More informationThe Expectation-Maximization Algorithm
The Expectation-Maximization Algorithm Francisco S. Melo In these notes, we provide a brief overview of the formal aspects concerning -means, EM and their relation. We closely follow the presentation in
More informationProbabilistic Graphical Models & Applications
Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with
More informationTwo Useful Bounds for Variational Inference
Two Useful Bounds for Variational Inference John Paisley Department of Computer Science Princeton University, Princeton, NJ jpaisley@princeton.edu Abstract We review and derive two lower bounds on the
More informationClustering with k-means and Gaussian mixture distributions
Clustering with k-means and Gaussian mixture distributions Machine Learning and Object Recognition 2017-2018 Jakob Verbeek Clustering Finding a group structure in the data Data in one cluster similar to
More informationAn Introduction to Expectation-Maximization
An Introduction to Expectation-Maximization Dahua Lin Abstract This notes reviews the basics about the Expectation-Maximization EM) algorithm, a popular approach to perform model estimation of the generative
More informationLatent Variable Models and EM algorithm
Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic
More informationBayesian Data Fusion with Gaussian Process Priors : An Application to Protein Fold Recognition
Bayesian Data Fusion with Gaussian Process Priors : An Application to Protein Fold Recognition Mar Girolami 1 Department of Computing Science University of Glasgow girolami@dcs.gla.ac.u 1 Introduction
More informationParametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a
Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a
More informationGaussian Processes for Big Data. James Hensman
Gaussian Processes for Big Data James Hensman Overview Motivation Sparse Gaussian Processes Stochastic Variational Inference Examples Overview Motivation Sparse Gaussian Processes Stochastic Variational
More informationStudy Notes on the Latent Dirichlet Allocation
Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection
More informationAutomated Variational Inference for Gaussian Process Models
Automated Variational Inference for Gaussian Process Models Trung V. Nguyen ANU & NICTA VanTrung.Nguyen@nicta.com.au Edwin V. Bonilla The University of New South Wales e.bonilla@unsw.edu.au Abstract We
More informationNon-Gaussian likelihoods for Gaussian Processes
Non-Gaussian likelihoods for Gaussian Processes Alan Saul University of Sheffield Outline Motivation Laplace approximation KL method Expectation Propagation Comparing approximations GP regression Model
More informationData Mining Techniques
Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 6 Jan-Willem van de Meent (credit: Yijun Zhao, Chris Bishop, Andrew Moore, Hastie et al.) Project Project Deadlines 3 Feb: Form teams of
More informationComputer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression
Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:
More informationComputer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression
Group Prof. Daniel Cremers 4. Gaussian Processes - Regression Definition (Rep.) Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.
More informationHOMEWORK #4: LOGISTIC REGRESSION
HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2019 Due: 11am Monday, February 25th, 2019 Submit scan of plots/written responses to Gradebook; submit your
More informationarxiv: v2 [stat.ml] 15 Aug 2017
Worshop trac - ICLR 207 REINTERPRETING IMPORTANCE-WEIGHTED AUTOENCODERS Chris Cremer, Quaid Morris & David Duvenaud Department of Computer Science University of Toronto {ccremer,duvenaud}@cs.toronto.edu
More informationClustering with k-means and Gaussian mixture distributions
Clustering with k-means and Gaussian mixture distributions Machine Learning and Category Representation 2014-2015 Jakob Verbeek, ovember 21, 2014 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.14.15
More informationGaussian Mixture Models
Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some
More information10-701/15-781, Machine Learning: Homework 4
10-701/15-781, Machine Learning: Homewor 4 Aarti Singh Carnegie Mellon University ˆ The assignment is due at 10:30 am beginning of class on Mon, Nov 15, 2010. ˆ Separate you answers into five parts, one
More informationSingle-tree GMM training
Single-tree GMM training Ryan R. Curtin May 27, 2015 1 Introduction In this short document, we derive a tree-independent single-tree algorithm for Gaussian mixture model training, based on a technique
More informationComputer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization
Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions
More informationClustering, K-Means, EM Tutorial
Clustering, K-Means, EM Tutorial Kamyar Ghasemipour Parts taken from Shikhar Sharma, Wenjie Luo, and Boris Ivanovic s tutorial slides, as well as lecture notes Organization: Clustering Motivation K-Means
More informationActive and Semi-supervised Kernel Classification
Active and Semi-supervised Kernel Classification Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London Work done in collaboration with Xiaojin Zhu (CMU), John Lafferty (CMU),
More informationVariational Model Selection for Sparse Gaussian Process Regression
Variational Model Selection for Sparse Gaussian Process Regression Michalis K. Titsias School of Computer Science University of Manchester 7 September 2008 Outline Gaussian process regression and sparse
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationLecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan
Lecture 3: Latent Variables Models and Learning with the EM Algorithm Sam Roweis Tuesday July25, 2006 Machine Learning Summer School, Taiwan Latent Variable Models What to do when a variable z is always
More informationLecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu
Lecture: Gaussian Process Regression STAT 6474 Instructor: Hongxiao Zhu Motivation Reference: Marc Deisenroth s tutorial on Robot Learning. 2 Fast Learning for Autonomous Robots with Gaussian Processes
More informationVariational Inference with Copula Augmentation
Variational Inference with Copula Augmentation Dustin Tran 1 David M. Blei 2 Edoardo M. Airoldi 1 1 Department of Statistics, Harvard University 2 Department of Statistics & Computer Science, Columbia
More informationExpectation Maximization
Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori 4 6 8 4 6 8 4 6 8 4 6 8 5 5 5 5 5 5 4 6 8 4 4 6 8 4 5 5 5 5 5 5 µ, Σ) α f Learningscale is slightly Parameters is slightly larger larger
More informationSTA414/2104. Lecture 11: Gaussian Processes. Department of Statistics
STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations
More informationGaussian Process Regression Networks
Gaussian Process Regression Networks Andrew Gordon Wilson agw38@camacuk mlgengcamacuk/andrew University of Cambridge Joint work with David A Knowles and Zoubin Ghahramani June 27, 2012 ICML, Edinburgh
More informationCSci 8980: Advanced Topics in Graphical Models Gaussian Processes
CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian
More informationStochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints
Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints Thang D. Bui Richard E. Turner tdb40@cam.ac.uk ret26@cam.ac.uk Computational and Biological Learning
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationThe Variational Gaussian Approximation Revisited
The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much
More informationLarge-Scale Feature Learning with Spike-and-Slab Sparse Coding
Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab
More informationState Space Gaussian Processes with Non-Gaussian Likelihoods
State Space Gaussian Processes with Non-Gaussian Likelihoods Hannes Nickisch 1 Arno Solin 2 Alexander Grigorievskiy 2,3 1 Philips Research, 2 Aalto University, 3 Silo.AI ICML2018 July 13, 2018 Outline
More informationIntegrated Non-Factorized Variational Inference
Integrated Non-Factorized Variational Inference Shaobo Han, Xuejun Liao and Lawrence Carin Duke University February 27, 2014 S. Han et al. Integrated Non-Factorized Variational Inference February 27, 2014
More informationTechnical Details about the Expectation Maximization (EM) Algorithm
Technical Details about the Expectation Maximization (EM Algorithm Dawen Liang Columbia University dliang@ee.columbia.edu February 25, 2015 1 Introduction Maximum Lielihood Estimation (MLE is widely used
More informationIntelligent Systems:
Intelligent Systems: Undirected Graphical models (Factor Graphs) (2 lectures) Carsten Rother 15/01/2015 Intelligent Systems: Probabilistic Inference in DGM and UGM Roadmap for next two lectures Definition
More informationSeries 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)
Exercises Introduction to Machine Learning SS 2018 Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning) LAS Group, Institute for Machine Learning Dept of Computer Science, ETH Zürich Prof
More informationIndependent Component Analysis and Unsupervised Learning. Jen-Tzung Chien
Independent Component Analysis and Unsupervised Learning Jen-Tzung Chien TABLE OF CONTENTS 1. Independent Component Analysis 2. Case Study I: Speech Recognition Independent voices Nonparametric likelihood
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1
More informationFully Bayesian Deep Gaussian Processes for Uncertainty Quantification
Fully Bayesian Deep Gaussian Processes for Uncertainty Quantification N. Zabaras 1 S. Atkinson 1 Center for Informatics and Computational Science Department of Aerospace and Mechanical Engineering University
More informationExpectation Maximization Mixture Models HMMs
11-755 Machine Learning for Signal rocessing Expectation Maximization Mixture Models HMMs Class 9. 21 Sep 2010 1 Learning Distributions for Data roblem: Given a collection of examples from some data, estimate
More informationA graph contains a set of nodes (vertices) connected by links (edges or arcs)
BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,
More informationNon-Factorised Variational Inference in Dynamical Systems
st Symposium on Advances in Approximate Bayesian Inference, 08 6 Non-Factorised Variational Inference in Dynamical Systems Alessandro D. Ialongo University of Cambridge and Max Planck Institute for Intelligent
More informationVariational Principal Components
Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings
More informationMachine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall
Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume
More informationPractical Bayesian Optimization of Machine Learning. Learning Algorithms
Practical Bayesian Optimization of Machine Learning Algorithms CS 294 University of California, Berkeley Tuesday, April 20, 2016 Motivation Machine Learning Algorithms (MLA s) have hyperparameters that
More informationLearning in Markov Random Fields with Contrastive Free Energies
Learning in Marov Random Fields with Contrastive Free Energies Max Welling School of Information and Computer Science University of California Irvine Irvine CA 92697-3425 USA welling@ics.uci.edu Charles
More informationPosterior Regularization
Posterior Regularization 1 Introduction One of the key challenges in probabilistic structured learning, is the intractability of the posterior distribution, for fast inference. There are numerous methods
More informationHOMEWORK #4: LOGISTIC REGRESSION
HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2018 Due: Friday, February 23rd, 2018, 11:55 PM Submit code and report via EEE Dropbox You should submit a
More informationMTTTS16 Learning from Multiple Sources
MTTTS16 Learning from Multiple Sources 5 ECTS credits Autumn 2018, University of Tampere Lecturer: Jaakko Peltonen Lecture 6: Multitask learning with kernel methods and nonparametric models On this lecture:
More informationBayesian Learning in Undirected Graphical Models
Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ Work with: Iain Murray and Hyun-Chul
More informationIndependent Component Analysis and Unsupervised Learning
Independent Component Analysis and Unsupervised Learning Jen-Tzung Chien National Cheng Kung University TABLE OF CONTENTS 1. Independent Component Analysis 2. Case Study I: Speech Recognition Independent
More informationLecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions
DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K
More informationLogistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu
Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data
More informationBayesian Approach 2. CSC412 Probabilistic Learning & Reasoning
CSC412 Probabilistic Learning & Reasoning Lecture 12: Bayesian Parameter Estimation February 27, 2006 Sam Roweis Bayesian Approach 2 The Bayesian programme (after Rev. Thomas Bayes) treats all unnown quantities
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationREINTERPRETING IMPORTANCE-WEIGHTED AUTOENCODERS
Worshop trac - ICLR 207 REINTERPRETING IMPORTANCE-WEIGHTED AUTOENCODERS Chris Cremer, Quaid Morris & David Duvenaud Department of Computer Science University of Toronto {ccremer,duvenaud}@cs.toronto.edu
More informationWeek 3: The EM algorithm
Week 3: The EM algorithm Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2005 Mixtures of Gaussians Data: Y = {y 1... y N } Latent
More informationc 4, < y 2, 1 0, otherwise,
Fundamentals of Big Data Analytics Univ.-Prof. Dr. rer. nat. Rudolf Mathar Problem. Probability theory: The outcome of an experiment is described by three events A, B and C. The probabilities Pr(A) =,
More informationMixtures of Gaussians. Sargur Srihari
Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm
More informationVariational Inference (11/04/13)
STA561: Probabilistic machine learning Variational Inference (11/04/13) Lecturer: Barbara Engelhardt Scribes: Matt Dickenson, Alireza Samany, Tracy Schifeling 1 Introduction In this lecture we will further
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Lecture Notes Fall 2009 November, 2009 Byoung-Ta Zhang School of Computer Science and Engineering & Cognitive Science, Brain Science, and Bioinformatics Seoul National University
More informationFast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine
Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine Nitish Srivastava nitish@cs.toronto.edu Ruslan Salahutdinov rsalahu@cs.toronto.edu Geoffrey Hinton hinton@cs.toronto.edu
More informationClustering with k-means and Gaussian mixture distributions
Clustering with k-means and Gaussian mixture distributions Machine Learning and Category Representation 2012-2013 Jakob Verbeek, ovember 23, 2012 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.12.13
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 20: Expectation Maximization Algorithm EM for Mixture Models Many figures courtesy Kevin Murphy s
More informationGaussian Processes (10/16/13)
STA561: Probabilistic machine learning Gaussian Processes (10/16/13) Lecturer: Barbara Engelhardt Scribes: Changwei Hu, Di Jin, Mengdi Wang 1 Introduction In supervised learning, we observe some inputs
More informationMACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun
Y. LeCun: Machine Learning and Pattern Recognition p. 1/? MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun The Courant Institute, New York University http://yann.lecun.com
More informationGAUSSIAN PROCESS REGRESSION
GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The
More informationGWAS V: Gaussian processes
GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011
More informationProbabilistic clustering
Aprendizagem Automática Probabilistic clustering Ludwig Krippahl Probabilistic clustering Summary Fuzzy sets and clustering Fuzzy c-means Probabilistic Clustering: mixture models Expectation-Maximization,
More informationICML Scalable Bayesian Inference on Point processes. with Gaussian Processes. Yves-Laurent Kom Samo & Stephen Roberts
ICML 2015 Scalable Nonparametric Bayesian Inference on Point Processes with Gaussian Processes Machine Learning Research Group and Oxford-Man Institute University of Oxford July 8, 2015 Point Processes
More informationLatent Variable Models and Expectation Maximization
Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9 2 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 5 1 15 2 25 5 1 15 2 25 2 4 6 8 1 12 14 2 4 6 8 1 12 14 5 1 15
More informationAlgorithmisches Lernen/Machine Learning
Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines
More informationMachine Learning Lecture Notes
Machine Learning Lecture Notes Predrag Radivojac January 25, 205 Basic Principles of Parameter Estimation In probabilistic modeling, we are typically presented with a set of observations and the objective
More informationFaster Stochastic Variational Inference using Proximal-Gradient Methods with General Divergence Functions
Faster Stochastic Variational Inference using Proximal-Gradient Methods with General Divergence Functions Mohammad Emtiyaz Khan, Reza Babanezhad, Wu Lin, Mark Schmidt, Masashi Sugiyama Conference on Uncertainty
More informationReview and Motivation
Review and Motivation We can model and visualize multimodal datasets by using multiple unimodal (Gaussian-like) clusters. K-means gives us a way of partitioning points into N clusters. Once we know which
More informationGaussian and Linear Discriminant Analysis; Multiclass Classification
Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015
More informationFast Inference and Learning with Sparse Belief Propagation
Fast Inference and Learning with Sparse Belief Propagation Chris Pal, Charles Sutton and Andrew McCallum University of Massachusetts Department of Computer Science Amherst, MA 01003 {pal,casutton,mccallum}@cs.umass.edu
More informationReview: Probabilistic Matrix Factorization. Probabilistic Matrix Factorization (PMF)
Case Study 4: Collaborative Filtering Review: Probabilistic Matrix Factorization Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 2 th, 214 Emily Fox 214 1 Probabilistic
More informationClustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.
Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)
More informationOutline Lecture 3 2(40)
Outline Lecture 3 4 Lecture 3 Expectation aximization E and Clustering Thomas Schön Division of Automatic Control Linöping University Linöping Sweden. Email: schon@isy.liu.se Phone: 3-8373 Office: House
More informationPattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM
Pattern Recognition and Machine Learning Chapter 9: Mixture Models and EM Thomas Mensink Jakob Verbeek October 11, 27 Le Menu 9.1 K-means clustering Getting the idea with a simple example 9.2 Mixtures
More informationMachine Learning Techniques for Computer Vision
Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM
More informationProbability and Information Theory. Sargur N. Srihari
Probability and Information Theory Sargur N. srihari@cedar.buffalo.edu 1 Topics in Probability and Information Theory Overview 1. Why Probability? 2. Random Variables 3. Probability Distributions 4. Marginal
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationAnother Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University
Another Walkthrough of Variational Bayes Bevan Jones Machine Learning Reading Group Macquarie University 2 Variational Bayes? Bayes Bayes Theorem But the integral is intractable! Sampling Gibbs, Metropolis
More informationTree-structured Gaussian Process Approximations
Tree-structured Gaussian Process Approximations Thang Bui joint work with Richard Turner MLG, Cambridge July 1st, 2014 1 / 27 Outline 1 Introduction 2 Tree-structured GP approximation 3 Experiments 4 Summary
More informationEstimating Gaussian Mixture Densities with EM A Tutorial
Estimating Gaussian Mixture Densities with EM A Tutorial Carlo Tomasi Due University Expectation Maximization (EM) [4, 3, 6] is a numerical algorithm for the maximization of functions of several variables
More information