# Unsupervised Learning

Size: px
Start display at page: ## Transcription

1 Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College London Term 1, Autumn 5

2 Model complexity and overfitting: a simple example M = M = 1 M = 2 M = M = 4 M = 5 M = 6 M =

3 Learning Model Structure How many clusters in the data? What is the intrinsic dimensionality of the data? Is this input relevant to predicting that output? What is the order of a dynamical system? How many states in a hidden Markov model? SVYDAAAQLTADVKKDLRDSWKVIGSDKKGNGVALMTTY How many auditory sources in the input?

4 Using Occam s Razor to Learn Model Structure Compare model classes m using their posterior probability given the data: P (y m)p (m) P (m y) =, P (y m) = P (y θ m, m)p (θ m m) dθ m P (y) Θ m Interpretation of P (y m): The probability that randomly selected parameter values from the model class would generate data set y. Model classes that are too simple are unlikely to generate the data set. Model classes that are too complex can generate many possible data sets, so again, they are unlikely to generate that particular data set at random. too simple P(Y M i ) "just right" too complex Y All possible data sets

5 Bayesian Model Comparison: Terminology A model class m is a set of models parameterised by θ m, e.g. the set of all possible mixtures of m Gaussians. The marginal likelihood of model class m: P (y m) = P (y θ m, m)p (θ m m) dθ m Θ m is also known as the Bayesian evidence for model m. The ratio of two marginal likelihoods is known as the Bayes factor: P (y m) P (y m ) The Occam s Razor principle is, roughly speaking, that one should prefer simpler explanations than more complex explanations. Bayesian inference formalises and automatically implements the Occam s Razor principle.

6 Bayesian Model Comparison: Occam s Razor at Work M = M = 1 M = 2 M = Model Evidence M = M = M = M = 7 P(Y M) M e.g. for quadratic (M=2): y = a +a 1 x+a 2 x 2 +ɛ, where ɛ N (, τ) and θ 2 = [a a 1 a 2 τ] demo: polybayes

7 Practical Bayesian approaches Laplace approximations: Makes a Gaussian approximation about the maximum a posteriori parameter estimate. Bayesian Information Criterion (BIC) an asymptotic approximation. Markov chain Monte Carlo methods (MCMC): In the limit are guaranteed to converge, but: Many samples required to ensure accuracy. Sometimes hard to assess convergence. Variational approximations Note: other deterministic approximations have been developed more recently and can be applied in this context: e.g. Bethe approximations and Expectation Propagation

8 Laplace Approximation data set: y models: m = 1..., M parameter sets: θ 1..., θ M Model Comparison: P (m y) P (m)p (y m) For large amounts of data (relative to number of parameters, d) the parameter posterior is approximately Gaussian around the MAP estimate ˆθ m : P (θ m y, m) (2π) d 1 2 A 2 exp { 1 } 2 (θ m ˆθ m ) A(θ m ˆθ m ) P (y m) = P (θ m, y m) P (θ m y, m) Evaluating the above for ln P (y m) at ˆθ m we get the Laplace approximation: ln P (y m) ln P (ˆθ m m) + ln P (y ˆθ m, m) + d 2 ln 2π 1 2 ln A A is the d d Hessian matrix of log P (θ m y, m): A kl = 2 θ mk θ ln P (θ m y,. ml m) ˆθm Can also be derived from 2 nd order Taylor expansion of log posterior. The Laplace approximation can be used for model comparison.

9 Bayesian Information Criterion (BIC) BIC can be obtained from the Laplace approximation: ln P (y m) ln P (ˆθ m m) + ln P (y ˆθ m, m) + d 2 ln 2π 1 2 ln A in the large sample limit (N ) where N is the number of data points, A grows as NA for some fixed matrix A, so ln A ln NA = ln(n d A ) = d ln N + ln A. Retaining only terms that grow in N we get: ln P (y m) ln P (y ˆθ m, m) d 2 ln N Properties: Quick and easy to compute It does not depend on the prior We can use the ML estimate of θ instead of the MAP estimate It is equivalent to the Minimum Description Length (MDL) criterion It assumes that in the large sample limit, all the parameters are well-determined (i.e. the model is identifiable; otherwise, d should be the number of well-determined parameters) Danger: counting parameters can be deceiving! (c.f. sinusoid, infinite models)

10 Sampling Approximations Let s consider a non-markov chain method, Importance Sampling: ln P (y m) = ln P (y θ m, m)p (θ m m) dθ m Θ m = ln P (y θ m, m) P (θ m m) Θ m Q(θ m ) Q(θ m) dθ m ln 1 K P (y θ (k) m, m) P (θ(k) m m) k Q(θ (k) m ) where θ (k) m are i.i.d. draws from Q(θ m ). Assumes we can sample from and evaluate Q(θ m ) (incl. normalization!) and we can compute the likelihood P (y θ (k) m, m). Although importance sampling does not work well in high dimensions, it inspires the following approach: Create a Markov chain, Q k Q k+1... for which: Q k (θ) can be evaluated including normalization lim k Q k (θ) = P (θ y, m)

11 Variational Bayesian Learning Lower Bounding the Marginal Likelihood Let the hidden latent variables be x, data y and the parameters θ. Lower bound the marginal likelihood (Bayesian model evidence) using Jensen s inequality: ln P (y) = ln = ln dx dθ P (y, x, θ) P (y, x, θ) dx dθ Q(x, θ) Q(x, θ) dx dθ Q(x, θ) ln P (y, x, θ) Q(x, θ). Use a simpler, factorised approximation to Q(x, θ): ln P (y) dx dθ Q x (x)q θ (θ) ln Maximize this lower bound. = F(Q x (x), Q θ (θ), y). P (y, x, θ) Q x (x)q θ (θ) m

12 Variational Bayesian Learning... Maximizing this lower bound, F, leads to EM-like updates: Q x(x) exp ln P (x,y θ) Qθ (θ) E like step Q θ(θ) P (θ) exp ln P (x,y θ) Qx (x) M like step Maximizing F is equivalent to minimizing KL-divergence between the approximate posterior, Q(θ)Q(x) and the true posterior, P (θ, x y). ln P (y) ln P (y) F(Q x (x), Q θ (θ), y) = dx dθ Q x (x)q θ (θ) ln P (y, x, θ) Q x (x)q θ (θ) dx dθ Q x (x)q θ (θ) ln Q x(x)q θ (θ) P (x, θ y) = = KL(Q P )

13 Conjugate-Exponential models Let s focus on conjugate-exponential (CE) models, which satisfy (1) and (2): Condition (1). The joint probability over variables is in the exponential family: P (x, y θ) = f(x, y) g(θ) exp { φ(θ) u(x, y) } where φ(θ) is the vector of natural parameters, u are sufficient statistics Condition (2). The prior over parameters is conjugate to this joint probability: P (θ η, ν) = h(η, ν) g(θ) η exp { φ(θ) ν } where η and ν are hyperparameters of the prior. Conjugate priors are computationally convenient and have an intuitive interpretation: η: number of pseudo-observations ν: values of pseudo-observations

14 Conjugate-Exponential examples In the CE family: Gaussian mixtures factor analysis, probabilistic PCA hidden Markov models and factorial HMMs linear dynamical systems and switching models discrete-variable belief networks Other as yet undreamt-of models can combine Gaussian, Gamma, Poisson, Dirichlet, Wishart, Multinomial and others. Not in the CE family: Boltzmann machines, MRFs (no simple conjugacy) logistic regression (no simple conjugacy) sigmoid belief networks (not exponential) independent components analysis (not exponential) Note: one can often approximate these models with models in the CE family.

15 A Useful Result Given an iid data set y = (y 1,... y n ), if the model is CE then: (a) Q θ (θ) is also conjugate, i.e. Q θ (θ) = h( η, ν)g(θ) η exp { φ(θ) ν } where η = η + n and ν = ν + i u(x i, y i ). (b) Q x (x) = n i=1 Q x i (x i ) is of the same form as in the E step of regular EM, but using pseudo parameters computed by averaging over Q θ (θ) Q xi (x i ) f(x i, y i ) exp { φ(θ) u(x i, y i ) } = P (x i y i, φ(θ)) KEY points: (a) the approximate parameter posterior is of the same form as the prior, so it is easily summarized in terms of two sets of hyperparameters, η and ν; (b) the approximate hidden variable posterior, averaging over all parameters, is of the same form as the hidden variable posterior for a single setting of the parameters, so again, it is easily computed using the usual methods.

16 The Variational Bayesian EM algorithm EM for MAP estimation Goal: maximize p(θ y, m) w.r.t. θ E Step: compute Variational Bayesian EM Goal: lower bound p(y m) VB-E Step: compute q (t+1) x (x) = p(x y, θ (t) ) q (t+1) x (x) = p(x y, φ (t) ) M Step: θ (t+1) =argmax θ Z q (t+1) x (x) ln p(x, y, θ) dx VB-M Step:»Z q (t+1) θ (θ) = exp q (t+1) x (x) ln p(x, y, θ) dx Properties: Reduces to the EM algorithm if q θ (θ) = δ(θ θ ). F m increases monotonically, and incorporates the model complexity penalty. Analytical parameter distributions (but not constrained to be Gaussian). VB-E step has same complexity as corresponding E step. We can use the junction tree, belief propagation, Kalman filter, etc, algorithms in the VB-E step of VB-EM, but using expected natural parameters, φ.

17 Variational Bayes: History of Models Treated multilayer perceptrons (Hinton & van Camp, 1993) mixture of experts (Waterhouse, MacKay & Robinson, 1996) hidden Markov models (MacKay, 1995) other work by Jaakkola, Jordan, Barber, Bishop, Tipping, etc Examples of Variational Learning of Model Structure mixtures of factor analysers (Ghahramani & Beal, 1999) mixtures of Gaussians (Attias, 1999) independent components analysis (Attias, 1999; Miskin & MacKay, ; Valpola ) principal components analysis (Bishop, 1999) linear dynamical systems (Ghahramani & Beal, ) mixture of experts (Ueda & Ghahramani, ) discrete graphical models (Beal & Ghahramani, 2) VIBES software for conjugate-exponential graphs (Winn, 3)

18 Mixture of Factor Analysers Goal: Infer number of clusters Infer intrinsic dimensionality of each cluster Under the assumption that each cluster is Gaussian embed demo

19 Mixture of Factor Analysers True data: 6 Gaussian clusters with dimensions: ( ) embedded in 1-D Inferred structure: number of points intrinsic dimensionalities per cluster Finds the clusters and dimensionalities efficiently. The model complexity reduces in line with the lack of data support. demos: run simple and ueda demo

20 Hidden Markov Models S 1 S 2 S 3 S T Y 1 Y 2 Y 3 Y T Discrete hidden states, s t. Observations y t. How many hidden states? What structure state-transition matrix? demo: vbhmm demo

21 Hidden Markov Models: Discriminating Forward from Reverse English First 8 sentences from Alice in Wonderland. Compare VB-HMM with ML-HMM. 1 Discriminating Forward and Reverse English 1 Model Selection using F 11 Classification Accuracy.8.6 F ML VB1 VB Hidden States Hidden States

22 Linear Dynamical Systems X 1 X 2 X 3 X T Y 1 Y 2 Y 3 Y T Assumes y t generated from a sequence of Markov hidden state variables x t If transition and output functions are linear, time-invariant, and noise distributions are Gaussian, this is a linear-gaussian state-space model: x t = Ax t 1 + w t, y t = Cx t + v t Three levels of inference: I Given data, structure and parameters, Kalman smoothing hidden state; II Given data and structure, EM hidden state and parameter point estimates; III Given data only, VEM model structure and distributions over parameters and hidden state.

23 Linear Dynamical System Results Inferring model structure: a) SSM(,3) i.e. FA b) SSM(3,3) c) SSM(3,4) a b c X X Y X X Y X X Y t 1 t t t 1 t t t 1 t t Inferred model complexity reduces with less data: True model: SSM(6,6) 1-dim observation vector. d 4 e 35 f 25 g 1 h 3 i j 1 X t 1 X t Y t X t 1 X t Y t X t 1 X t Y t X t 1 X t Y t X t 1 X t Y t X t 1 X t Y t X t 1 X t Y t demo: bayeslds

24 Independent Components Analysis Blind Source Separation: 5 1 msec speech and music sources linearly mixed to produce 11 signals (microphones) from Attias ()

25 Summary & Conclusions Bayesian learning avoids overfitting and can be used to do model comparison / selection. But we need approximations: Laplace BIC Sampling Variational

26 Other topics I would have liked to cover in the course but did not have time to... Nonparametric Bayesian learning, infinite models and Dirichlet Processes Bayes net structure learning Nonlinear dimensionality reduction More on loopy belief propagation, Bethe free energies, and Expectation Propagation Exact sampling Particle filters Semi-supervised learning Gaussian processes and SVMs (supervised!) Reinforcement Learning

### Statistical Approaches to Learning and Discovery Statistical Approaches to Learning and Discovery Bayesian Model Selection Zoubin Ghahramani & Teddy Seidenfeld zoubin@cs.cmu.edu & teddy@stat.cmu.edu CALD / CS / Statistics / Philosophy Carnegie Mellon

### Variational Scoring of Graphical Model Structures Variational Scoring of Graphical Model Structures Matthew J. Beal Work with Zoubin Ghahramani & Carl Rasmussen, Toronto. 15th September 2003 Overview Bayesian model selection Approximations using Variational

### CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection (non-examinable material) Matthew J. Beal February 27, 2004 www.variational-bayes.org Bayesian Model Selection

### Variational inference in the conjugate-exponential family Variational inference in the conjugate-exponential family Matthew J. Beal Work with Zoubin Ghahramani Gatsby Computational Neuroscience Unit August 2000 Abstract Variational inference in the conjugate-exponential

### The Variational Bayesian EM Algorithm for Incomplete Data: with Application to Scoring Graphical Model Structures BAYESIAN STATISTICS 7, pp. 000 000 J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid, D. Heckerman, A. F. M. Smith and M. West (Eds.) c Oxford University Press, 2003 The Variational Bayesian EM

### Machine Learning Summer School Machine Learning Summer School Lecture 3: Learning parameters and structure Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Department of Engineering University of Cambridge,

### Bayesian Hidden Markov Models and Extensions Bayesian Hidden Markov Models and Extensions Zoubin Ghahramani Department of Engineering University of Cambridge joint work with Matt Beal, Jurgen van Gael, Yunus Saatci, Tom Stepleton, Yee Whye Teh Modeling

### Lecture 6: Graphical Models: Learning Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)

### Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

### Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

### Probabilistic Modelling and Bayesian Inference Probabilistic Modelling and Bayesian Inference Zoubin Ghahramani University of Cambridge, UK Uber AI Labs, San Francisco, USA zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ MLSS Tübingen Lectures

### Lecture 13 : Variational Inference: Mean Field Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1

### 13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

### Probabilistic Modelling and Bayesian Inference Probabilistic Modelling and Bayesian Inference Zoubin Ghahramani Department of Engineering University of Cambridge, UK zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ MLSS Tübingen Lectures

### STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

### An introduction to Variational calculus in Machine Learning n introduction to Variational calculus in Machine Learning nders Meng February 2004 1 Introduction The intention of this note is not to give a full understanding of calculus of variations since this area

### Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

### Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures 17th Europ. Conf. on Machine Learning, Berlin, Germany, 2006. Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures Shipeng Yu 1,2, Kai Yu 2, Volker Tresp 2, and Hans-Peter

### p L yi z n m x N n xi y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen

### Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

### Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

### Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ Bayesian paradigm Consistent use of probability theory

### Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1 Bayesian paradigm Consistent use of probability theory

### Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak 1 Introduction. Random variables During the course we are interested in reasoning about considered phenomenon. In other words,

### U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models Jaemo Sung 1, Sung-Yang Bang 1, Seungjin Choi 1, and Zoubin Ghahramani 2 1 Department of Computer Science, POSTECH,

### Bayesian Inference Course, WTCN, UCL, March 2013 Bayesian Course, WTCN, UCL, March 2013 Shannon (1948) asked how much information is received when we observe a specific value of the variable x? If an unlikely event occurs then one would expect the information

### Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September

### STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

### Integrated Non-Factorized Variational Inference Integrated Non-Factorized Variational Inference Shaobo Han, Xuejun Liao and Lawrence Carin Duke University February 27, 2014 S. Han et al. Integrated Non-Factorized Variational Inference February 27, 2014

### an introduction to bayesian inference with an application to network analysis http://jakehofman.com january 13, 2010 motivation would like models that: provide predictive and explanatory power are complex enough to describe observed phenomena

### Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K

### Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 20: Expectation Maximization Algorithm EM for Mixture Models Many figures courtesy Kevin Murphy s

### Part 1: Expectation Propagation Chalmers Machine Learning Summer School Approximate message passing and biomedicine Part 1: Expectation Propagation Tom Heskes Machine Learning Group, Institute for Computing and Information Sciences Radboud

### Variational Learning : From exponential families to multilinear systems Variational Learning : From exponential families to multilinear systems Ananth Ranganathan th February 005 Abstract This note aims to give a general overview of variational inference on graphical models.

### Bayesian Learning in Undirected Graphical Models Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ Work with: Iain Murray and Hyun-Chul

### Variational Message Passing. By John Winn, Christopher M. Bishop Presented by Andy Miller Variational Message Passing By John Winn, Christopher M. Bishop Presented by Andy Miller Overview Background Variational Inference Conjugate-Exponential Models Variational Message Passing Messages Univariate

### Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

### Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric? Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric? Zoubin Ghahramani Department of Engineering University of Cambridge, UK zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/

### Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

### 9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

### Introduction to Probabilistic Machine Learning Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning

### ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4 ECE52 Tutorial Topic Review ECE52 Winter 206 Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides ECE52 Tutorial ECE52 Winter 206 Credits to Alireza / 4 Outline K-means, PCA 2 Bayesian

### Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

### G8325: Variational Bayes G8325: Variational Bayes Vincent Dorie Columbia University Wednesday, November 2nd, 2011 bridge Variational University Bayes Press 2003. On-screen viewing permitted. Printing not permitted. http://www.c

### Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group Nonparmeteric Bayes & Gaussian Processes Baback Moghaddam baback@jpl.nasa.gov Machine Learning Group Outline Bayesian Inference Hierarchical Models Model Selection Parametric vs. Nonparametric Gaussian

### Bayesian Regression Linear and Logistic Regression When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we

### CSci 8980: Advanced Topics in Graphical Models Gaussian Processes CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian

### Introduction to Probabilistic Modelling and Machine Learning Introduction to Probabilistic Modelling and Machine Learning Zoubin Ghahramani Department of Engineering University of Cambridge, UK zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Wroc law

### Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

### Probabilistic & Unsupervised Learning Probabilistic & Unsupervised Learning Gaussian Processes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London

### Lecture 9: PGM Learning 13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and

### Graphical models: parameter learning Graphical models: parameter learning Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London London WC1N 3AR, England http://www.gatsby.ucl.ac.uk/ zoubin/ zoubin@gatsby.ucl.ac.uk

### Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

### PMR Learning as Inference Outline PMR Learning as Inference Probabilistic Modelling and Reasoning Amos Storkey Modelling 2 The Exponential Family 3 Bayesian Sets School of Informatics, University of Edinburgh Amos Storkey PMR Learning

### Probabilistic and Bayesian Machine Learning Probabilistic and Bayesian Machine Learning Lecture 1: Introduction to Probabilistic Modelling Yee Whye Teh ywteh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Why a

### Expectation Propagation for Approximate Bayesian Inference Expectation Propagation for Approximate Bayesian Inference José Miguel Hernández Lobato Universidad Autónoma de Madrid, Computer Science Department February 5, 2007 1/ 24 Bayesian Inference Inference Given

### Linear Dynamical Systems Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations

### Lecture 4: Probabilistic Learning DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods

### Lecture 1a: Basic Concepts and Recaps Lecture 1a: Basic Concepts and Recaps Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk Advanced

### Graphical Models and Kernel Methods Graphical Models and Kernel Methods Jerry Zhu Department of Computer Sciences University of Wisconsin Madison, USA MLSS June 17, 2014 1 / 123 Outline Graphical Models Probabilistic Inference Directed vs.

### CSC 2541: Bayesian Methods for Machine Learning CSC 2541: Bayesian Methods for Machine Learning Radford M. Neal, University of Toronto, 2011 Lecture 10 Alternatives to Monte Carlo Computation Since about 1990, Markov chain Monte Carlo has been the dominant

### Active and Semi-supervised Kernel Classification Active and Semi-supervised Kernel Classification Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London Work done in collaboration with Xiaojin Zhu (CMU), John Lafferty (CMU),

### Lecture 1b: Linear Models for Regression Lecture 1b: Linear Models for Regression Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk Advanced

### VIBES: A Variational Inference Engine for Bayesian Networks VIBES: A Variational Inference Engine for Bayesian Networks Christopher M. Bishop Microsoft Research Cambridge, CB3 0FB, U.K. research.microsoft.com/ cmbishop David Spiegelhalter MRC Biostatistics Unit

### Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

### Machine Learning Techniques for Computer Vision Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM

### Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 3 Stochastic Gradients, Bayesian Inference, and Occam s Razor https://people.orie.cornell.edu/andrew/orie6741 Cornell University August

### Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Variational Inference II: Mean Field Method and Variational Principle Junming Yin Lecture 15, March 7, 2012 X 1 X 1 X 1 X 1 X 2 X 3 X 2 X 2 X 3

### Graphical Models for Collaborative Filtering Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,

### The Variational Gaussian Approximation Revisited The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much

### Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI Generative and Discriminative Approaches to Graphical Models CMSC 35900 Topics in AI Lecture 2 Yasemin Altun January 26, 2007 Review of Inference on Graphical Models Elimination algorithm finds single

### Bayesian Learning in Undirected Graphical Models Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ and Center for Automated Learning and

### Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite

### Probabilistic Time Series Classification Probabilistic Time Series Classification Y. Cem Sübakan Boğaziçi University 25.06.2013 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 1 / 54 Problem Statement The goal is to assign

### Learning Bayesian network : Given structure and completely observed data Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution

### Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

### Statistical learning. Chapter 20, Sections 1 4 1 Statistical learning Chapter 20, Sections 1 4 Chapter 20, Sections 1 4 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete

### CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas CS839: Probabilistic Graphical Models Lecture 7: Learning Fully Observed BNs Theo Rekatsinas 1 Exponential family: a basic building block For a numeric random variable X p(x ) =h(x)exp T T (x) A( ) = 1

### A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

### Machine Learning. Probability Basics. Marc Toussaint University of Stuttgart Summer 2014 Machine Learning Probability Basics Basic definitions: Random variables, joint, conditional, marginal distribution, Bayes theorem & examples; Probability distributions: Binomial, Beta, Multinomial, Dirichlet,

### Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007 Sequential Monte Carlo and Particle Filtering Frank Wood Gatsby, November 2007 Importance Sampling Recall: Let s say that we want to compute some expectation (integral) E p [f] = p(x)f(x)dx and we remember

### Statistical Learning Theory of Variational Bayes Statistical Learning Theory of Variational Bayes Department of Computational Intelligence and Systems Science Interdisciplinary Graduate School of Science and Engineering Tokyo Institute of Technology

### Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

### Learning Gaussian Process Models from Uncertain Data Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada

### Lecture 8: Bayesian Estimation of Parameters in State Space Models in State Space Models March 30, 2016 Contents 1 Bayesian estimation of parameters in state space models 2 Computational methods for parameter estimation 3 Practical parameter estimation in state space

### Foundations of Statistical Inference Foundations of Statistical Inference Julien Berestycki Department of Statistics University of Oxford MT 2016 Julien Berestycki (University of Oxford) SB2a MT 2016 1 / 32 Lecture 14 : Variational Bayes

### Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

### UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

### Bayesian Networks Inference with Probabilistic Graphical Models 4190.408 2016-Spring Bayesian Networks Inference with Probabilistic Graphical Models Byoung-Tak Zhang intelligence Lab Seoul National University 4190.408 Artificial (2016-Spring) 1 Machine Learning? Learning

### PATTERN CLASSIFICATION PATTERN CLASSIFICATION Second Edition Richard O. Duda Peter E. Hart David G. Stork A Wiley-lnterscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim Brisbane Singapore Toronto CONTENTS

### Probabilistic & Unsupervised Learning Probabilistic & Unsupervised Learning Week 2: Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College

### COS513 LECTURE 8 STATISTICAL CONCEPTS COS513 LECTURE 8 STATISTICAL CONCEPTS NIKOLAI SLAVOV AND ANKUR PARIKH 1. MAKING MEANINGFUL STATEMENTS FROM JOINT PROBABILITY DISTRIBUTIONS. A graphical model (GM) represents a family of probability distributions

### Expectation Propagation Algorithm Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,

### Sequential Monte Carlo Methods for Bayesian Computation Sequential Monte Carlo Methods for Bayesian Computation A. Doucet Kyoto Sept. 2012 A. Doucet (MLSS Sept. 2012) Sept. 2012 1 / 136 Motivating Example 1: Generic Bayesian Model Let X be a vector parameter

### Variational Inference via Stochastic Backpropagation Variational Inference via Stochastic Backpropagation Kai Fan February 27, 2016 Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary Outline Preliminaries Stochastic Backpropagation

### Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I What We Did The Machine Learning Zoo Moving Forward M Magdon-Ismail CSCI 4100/6100 recap: Three Learning Principles Scientist 2 Another Walkthrough of Variational Bayes Bevan Jones Machine Learning Reading Group Macquarie University 2 Variational Bayes? Bayes Bayes Theorem But the integral is intractable! Sampling Gibbs, Metropolis Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection