# Unsupervised Learning

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College London Term 1, Autumn 5

2 Model complexity and overfitting: a simple example M = M = 1 M = 2 M = M = 4 M = 5 M = 6 M =

3 Learning Model Structure How many clusters in the data? What is the intrinsic dimensionality of the data? Is this input relevant to predicting that output? What is the order of a dynamical system? How many states in a hidden Markov model? SVYDAAAQLTADVKKDLRDSWKVIGSDKKGNGVALMTTY How many auditory sources in the input?

4 Using Occam s Razor to Learn Model Structure Compare model classes m using their posterior probability given the data: P (y m)p (m) P (m y) =, P (y m) = P (y θ m, m)p (θ m m) dθ m P (y) Θ m Interpretation of P (y m): The probability that randomly selected parameter values from the model class would generate data set y. Model classes that are too simple are unlikely to generate the data set. Model classes that are too complex can generate many possible data sets, so again, they are unlikely to generate that particular data set at random. too simple P(Y M i ) "just right" too complex Y All possible data sets

5 Bayesian Model Comparison: Terminology A model class m is a set of models parameterised by θ m, e.g. the set of all possible mixtures of m Gaussians. The marginal likelihood of model class m: P (y m) = P (y θ m, m)p (θ m m) dθ m Θ m is also known as the Bayesian evidence for model m. The ratio of two marginal likelihoods is known as the Bayes factor: P (y m) P (y m ) The Occam s Razor principle is, roughly speaking, that one should prefer simpler explanations than more complex explanations. Bayesian inference formalises and automatically implements the Occam s Razor principle.

6 Bayesian Model Comparison: Occam s Razor at Work M = M = 1 M = 2 M = Model Evidence M = M = M = M = 7 P(Y M) M e.g. for quadratic (M=2): y = a +a 1 x+a 2 x 2 +ɛ, where ɛ N (, τ) and θ 2 = [a a 1 a 2 τ] demo: polybayes

7 Practical Bayesian approaches Laplace approximations: Makes a Gaussian approximation about the maximum a posteriori parameter estimate. Bayesian Information Criterion (BIC) an asymptotic approximation. Markov chain Monte Carlo methods (MCMC): In the limit are guaranteed to converge, but: Many samples required to ensure accuracy. Sometimes hard to assess convergence. Variational approximations Note: other deterministic approximations have been developed more recently and can be applied in this context: e.g. Bethe approximations and Expectation Propagation

8 Laplace Approximation data set: y models: m = 1..., M parameter sets: θ 1..., θ M Model Comparison: P (m y) P (m)p (y m) For large amounts of data (relative to number of parameters, d) the parameter posterior is approximately Gaussian around the MAP estimate ˆθ m : P (θ m y, m) (2π) d 1 2 A 2 exp { 1 } 2 (θ m ˆθ m ) A(θ m ˆθ m ) P (y m) = P (θ m, y m) P (θ m y, m) Evaluating the above for ln P (y m) at ˆθ m we get the Laplace approximation: ln P (y m) ln P (ˆθ m m) + ln P (y ˆθ m, m) + d 2 ln 2π 1 2 ln A A is the d d Hessian matrix of log P (θ m y, m): A kl = 2 θ mk θ ln P (θ m y,. ml m) ˆθm Can also be derived from 2 nd order Taylor expansion of log posterior. The Laplace approximation can be used for model comparison.

9 Bayesian Information Criterion (BIC) BIC can be obtained from the Laplace approximation: ln P (y m) ln P (ˆθ m m) + ln P (y ˆθ m, m) + d 2 ln 2π 1 2 ln A in the large sample limit (N ) where N is the number of data points, A grows as NA for some fixed matrix A, so ln A ln NA = ln(n d A ) = d ln N + ln A. Retaining only terms that grow in N we get: ln P (y m) ln P (y ˆθ m, m) d 2 ln N Properties: Quick and easy to compute It does not depend on the prior We can use the ML estimate of θ instead of the MAP estimate It is equivalent to the Minimum Description Length (MDL) criterion It assumes that in the large sample limit, all the parameters are well-determined (i.e. the model is identifiable; otherwise, d should be the number of well-determined parameters) Danger: counting parameters can be deceiving! (c.f. sinusoid, infinite models)

10 Sampling Approximations Let s consider a non-markov chain method, Importance Sampling: ln P (y m) = ln P (y θ m, m)p (θ m m) dθ m Θ m = ln P (y θ m, m) P (θ m m) Θ m Q(θ m ) Q(θ m) dθ m ln 1 K P (y θ (k) m, m) P (θ(k) m m) k Q(θ (k) m ) where θ (k) m are i.i.d. draws from Q(θ m ). Assumes we can sample from and evaluate Q(θ m ) (incl. normalization!) and we can compute the likelihood P (y θ (k) m, m). Although importance sampling does not work well in high dimensions, it inspires the following approach: Create a Markov chain, Q k Q k+1... for which: Q k (θ) can be evaluated including normalization lim k Q k (θ) = P (θ y, m)

11 Variational Bayesian Learning Lower Bounding the Marginal Likelihood Let the hidden latent variables be x, data y and the parameters θ. Lower bound the marginal likelihood (Bayesian model evidence) using Jensen s inequality: ln P (y) = ln = ln dx dθ P (y, x, θ) P (y, x, θ) dx dθ Q(x, θ) Q(x, θ) dx dθ Q(x, θ) ln P (y, x, θ) Q(x, θ). Use a simpler, factorised approximation to Q(x, θ): ln P (y) dx dθ Q x (x)q θ (θ) ln Maximize this lower bound. = F(Q x (x), Q θ (θ), y). P (y, x, θ) Q x (x)q θ (θ) m

12 Variational Bayesian Learning... Maximizing this lower bound, F, leads to EM-like updates: Q x(x) exp ln P (x,y θ) Qθ (θ) E like step Q θ(θ) P (θ) exp ln P (x,y θ) Qx (x) M like step Maximizing F is equivalent to minimizing KL-divergence between the approximate posterior, Q(θ)Q(x) and the true posterior, P (θ, x y). ln P (y) ln P (y) F(Q x (x), Q θ (θ), y) = dx dθ Q x (x)q θ (θ) ln P (y, x, θ) Q x (x)q θ (θ) dx dθ Q x (x)q θ (θ) ln Q x(x)q θ (θ) P (x, θ y) = = KL(Q P )

13 Conjugate-Exponential models Let s focus on conjugate-exponential (CE) models, which satisfy (1) and (2): Condition (1). The joint probability over variables is in the exponential family: P (x, y θ) = f(x, y) g(θ) exp { φ(θ) u(x, y) } where φ(θ) is the vector of natural parameters, u are sufficient statistics Condition (2). The prior over parameters is conjugate to this joint probability: P (θ η, ν) = h(η, ν) g(θ) η exp { φ(θ) ν } where η and ν are hyperparameters of the prior. Conjugate priors are computationally convenient and have an intuitive interpretation: η: number of pseudo-observations ν: values of pseudo-observations

14 Conjugate-Exponential examples In the CE family: Gaussian mixtures factor analysis, probabilistic PCA hidden Markov models and factorial HMMs linear dynamical systems and switching models discrete-variable belief networks Other as yet undreamt-of models can combine Gaussian, Gamma, Poisson, Dirichlet, Wishart, Multinomial and others. Not in the CE family: Boltzmann machines, MRFs (no simple conjugacy) logistic regression (no simple conjugacy) sigmoid belief networks (not exponential) independent components analysis (not exponential) Note: one can often approximate these models with models in the CE family.

15 A Useful Result Given an iid data set y = (y 1,... y n ), if the model is CE then: (a) Q θ (θ) is also conjugate, i.e. Q θ (θ) = h( η, ν)g(θ) η exp { φ(θ) ν } where η = η + n and ν = ν + i u(x i, y i ). (b) Q x (x) = n i=1 Q x i (x i ) is of the same form as in the E step of regular EM, but using pseudo parameters computed by averaging over Q θ (θ) Q xi (x i ) f(x i, y i ) exp { φ(θ) u(x i, y i ) } = P (x i y i, φ(θ)) KEY points: (a) the approximate parameter posterior is of the same form as the prior, so it is easily summarized in terms of two sets of hyperparameters, η and ν; (b) the approximate hidden variable posterior, averaging over all parameters, is of the same form as the hidden variable posterior for a single setting of the parameters, so again, it is easily computed using the usual methods.

16 The Variational Bayesian EM algorithm EM for MAP estimation Goal: maximize p(θ y, m) w.r.t. θ E Step: compute Variational Bayesian EM Goal: lower bound p(y m) VB-E Step: compute q (t+1) x (x) = p(x y, θ (t) ) q (t+1) x (x) = p(x y, φ (t) ) M Step: θ (t+1) =argmax θ Z q (t+1) x (x) ln p(x, y, θ) dx VB-M Step:»Z q (t+1) θ (θ) = exp q (t+1) x (x) ln p(x, y, θ) dx Properties: Reduces to the EM algorithm if q θ (θ) = δ(θ θ ). F m increases monotonically, and incorporates the model complexity penalty. Analytical parameter distributions (but not constrained to be Gaussian). VB-E step has same complexity as corresponding E step. We can use the junction tree, belief propagation, Kalman filter, etc, algorithms in the VB-E step of VB-EM, but using expected natural parameters, φ.

17 Variational Bayes: History of Models Treated multilayer perceptrons (Hinton & van Camp, 1993) mixture of experts (Waterhouse, MacKay & Robinson, 1996) hidden Markov models (MacKay, 1995) other work by Jaakkola, Jordan, Barber, Bishop, Tipping, etc Examples of Variational Learning of Model Structure mixtures of factor analysers (Ghahramani & Beal, 1999) mixtures of Gaussians (Attias, 1999) independent components analysis (Attias, 1999; Miskin & MacKay, ; Valpola ) principal components analysis (Bishop, 1999) linear dynamical systems (Ghahramani & Beal, ) mixture of experts (Ueda & Ghahramani, ) discrete graphical models (Beal & Ghahramani, 2) VIBES software for conjugate-exponential graphs (Winn, 3)

18 Mixture of Factor Analysers Goal: Infer number of clusters Infer intrinsic dimensionality of each cluster Under the assumption that each cluster is Gaussian embed demo

19 Mixture of Factor Analysers True data: 6 Gaussian clusters with dimensions: ( ) embedded in 1-D Inferred structure: number of points intrinsic dimensionalities per cluster Finds the clusters and dimensionalities efficiently. The model complexity reduces in line with the lack of data support. demos: run simple and ueda demo

20 Hidden Markov Models S 1 S 2 S 3 S T Y 1 Y 2 Y 3 Y T Discrete hidden states, s t. Observations y t. How many hidden states? What structure state-transition matrix? demo: vbhmm demo

21 Hidden Markov Models: Discriminating Forward from Reverse English First 8 sentences from Alice in Wonderland. Compare VB-HMM with ML-HMM. 1 Discriminating Forward and Reverse English 1 Model Selection using F 11 Classification Accuracy.8.6 F ML VB1 VB Hidden States Hidden States

22 Linear Dynamical Systems X 1 X 2 X 3 X T Y 1 Y 2 Y 3 Y T Assumes y t generated from a sequence of Markov hidden state variables x t If transition and output functions are linear, time-invariant, and noise distributions are Gaussian, this is a linear-gaussian state-space model: x t = Ax t 1 + w t, y t = Cx t + v t Three levels of inference: I Given data, structure and parameters, Kalman smoothing hidden state; II Given data and structure, EM hidden state and parameter point estimates; III Given data only, VEM model structure and distributions over parameters and hidden state.

23 Linear Dynamical System Results Inferring model structure: a) SSM(,3) i.e. FA b) SSM(3,3) c) SSM(3,4) a b c X X Y X X Y X X Y t 1 t t t 1 t t t 1 t t Inferred model complexity reduces with less data: True model: SSM(6,6) 1-dim observation vector. d 4 e 35 f 25 g 1 h 3 i j 1 X t 1 X t Y t X t 1 X t Y t X t 1 X t Y t X t 1 X t Y t X t 1 X t Y t X t 1 X t Y t X t 1 X t Y t demo: bayeslds

24 Independent Components Analysis Blind Source Separation: 5 1 msec speech and music sources linearly mixed to produce 11 signals (microphones) from Attias ()

25 Summary & Conclusions Bayesian learning avoids overfitting and can be used to do model comparison / selection. But we need approximations: Laplace BIC Sampling Variational

26 Other topics I would have liked to cover in the course but did not have time to... Nonparametric Bayesian learning, infinite models and Dirichlet Processes Bayes net structure learning Nonlinear dimensionality reduction More on loopy belief propagation, Bethe free energies, and Expectation Propagation Exact sampling Particle filters Semi-supervised learning Gaussian processes and SVMs (supervised!) Reinforcement Learning

### The Variational Bayesian EM Algorithm for Incomplete Data: with Application to Scoring Graphical Model Structures

BAYESIAN STATISTICS 7, pp. 000 000 J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid, D. Heckerman, A. F. M. Smith and M. West (Eds.) c Oxford University Press, 2003 The Variational Bayesian EM

### Bayesian Hidden Markov Models and Extensions

Bayesian Hidden Markov Models and Extensions Zoubin Ghahramani Department of Engineering University of Cambridge joint work with Matt Beal, Jurgen van Gael, Yunus Saatci, Tom Stepleton, Yee Whye Teh Modeling

### Probabilistic Modelling and Bayesian Inference

Probabilistic Modelling and Bayesian Inference Zoubin Ghahramani University of Cambridge, UK Uber AI Labs, San Francisco, USA zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ MLSS Tübingen Lectures

### Probabilistic Modelling and Bayesian Inference

Probabilistic Modelling and Bayesian Inference Zoubin Ghahramani Department of Engineering University of Cambridge, UK zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ MLSS Tübingen Lectures

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

### An introduction to Variational calculus in Machine Learning

n introduction to Variational calculus in Machine Learning nders Meng February 2004 1 Introduction The intention of this note is not to give a full understanding of calculus of variations since this area

### p L yi z n m x N n xi

y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen

### Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

### Variational Principal Components

Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

### Density Estimation. Seungjin Choi

Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

### Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ Bayesian paradigm Consistent use of probability theory

### Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1 Bayesian paradigm Consistent use of probability theory

### Bayesian Machine Learning

Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September

### Variational Message Passing. By John Winn, Christopher M. Bishop Presented by Andy Miller

Variational Message Passing By John Winn, Christopher M. Bishop Presented by Andy Miller Overview Background Variational Inference Conjugate-Exponential Models Variational Message Passing Messages Univariate

### Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

### Linear Dynamical Systems

Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations

### Lecture 9: PGM Learning

13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and

### VIBES: A Variational Inference Engine for Bayesian Networks

VIBES: A Variational Inference Engine for Bayesian Networks Christopher M. Bishop Microsoft Research Cambridge, CB3 0FB, U.K. research.microsoft.com/ cmbishop David Spiegelhalter MRC Biostatistics Unit

### Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007

Sequential Monte Carlo and Particle Filtering Frank Wood Gatsby, November 2007 Importance Sampling Recall: Let s say that we want to compute some expectation (integral) E p [f] = p(x)f(x)dx and we remember

### Machine Learning Techniques for Computer Vision

Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM

### Bayesian Learning in Undirected Graphical Models

Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ and Center for Automated Learning and

### Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Generative and Discriminative Approaches to Graphical Models CMSC 35900 Topics in AI Lecture 2 Yasemin Altun January 26, 2007 Review of Inference on Graphical Models Elimination algorithm finds single

### Foundations of Statistical Inference

Foundations of Statistical Inference Julien Berestycki Department of Statistics University of Oxford MT 2016 Julien Berestycki (University of Oxford) SB2a MT 2016 1 / 32 Lecture 14 : Variational Bayes

### Machine Learning. Probability Basics. Marc Toussaint University of Stuttgart Summer 2014

Machine Learning Probability Basics Basic definitions: Random variables, joint, conditional, marginal distribution, Bayes theorem & examples; Probability distributions: Binomial, Beta, Multinomial, Dirichlet,

### PATTERN CLASSIFICATION

PATTERN CLASSIFICATION Second Edition Richard O. Duda Peter E. Hart David G. Stork A Wiley-lnterscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim Brisbane Singapore Toronto CONTENTS

### Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite

### Statistical learning. Chapter 20, Sections 1 4 1

Statistical learning Chapter 20, Sections 1 4 Chapter 20, Sections 1 4 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete

### UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

### Week 3: The EM algorithm

Week 3: The EM algorithm Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2005 Mixtures of Gaussians Data: Y = {y 1... y N } Latent

### Approximate Inference using MCMC

Approximate Inference using MCMC 9.520 Class 22 Ruslan Salakhutdinov BCS and CSAIL, MIT 1 Plan 1. Introduction/Notation. 2. Examples of successful Bayesian models. 3. Basic Sampling Algorithms. 4. Markov

### Study Notes on the Latent Dirichlet Allocation

Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection

### Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

### Introduction to Graphical Models

Introduction to Graphical Models The 15 th Winter School of Statistical Physics POSCO International Center & POSTECH, Pohang 2018. 1. 9 (Tue.) Yung-Kyun Noh GENERALIZATION FOR PREDICTION 2 Probabilistic

### Two Useful Bounds for Variational Inference

Two Useful Bounds for Variational Inference John Paisley Department of Computer Science Princeton University, Princeton, NJ jpaisley@princeton.edu Abstract We review and derive two lower bounds on the

### PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables

### Introduction to Machine Learning

Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

### An Unsupervised Ensemble Learning Method for Nonlinear Dynamic State-Space Models

An Unsupervised Ensemble Learning Method for Nonlinear Dynamic State-Space Models Harri Valpola and Juha Karhunen Helsinki University of Technology, Neural Networks Research Centre P.O.Box 5400, FIN-02015

### Variational Message Passing

Journal of Machine Learning Research 5 (2004)?-? Submitted 2/04; Published?/04 Variational Message Passing John Winn and Christopher M. Bishop Microsoft Research Cambridge Roger Needham Building 7 J. J.

### Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

Fall 3 Computer Vision Overview of Statistical Tools Statistical Inference Haibin Ling Observation inference Decision Prior knowledge http://www.dabi.temple.edu/~hbling/teaching/3f_5543/index.html Bayesian

### Outline Lecture 2 2(32)

Outline Lecture (3), Lecture Linear Regression and Classification it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic

### Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter

### NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

### Non-linear Bayesian Image Modelling

Non-linear Bayesian Image Modelling Christopher M. Bishop 1 and John M. Winn 2 1 Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/

### Probabilistic Graphical Models

Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft

### CPSC 540: Machine Learning

CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

### PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA Contents in latter part Linear Dynamical Systems What is different from HMM? Kalman filter Its strength and limitation Particle Filter

### σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

### Probabilistic Graphical Models

2016 Robert Nowak Probabilistic Graphical Models 1 Introduction We have focused mainly on linear models for signals, in particular the subspace model x = Uθ, where U is a n k matrix and θ R k is a vector

### Variational Bayesian Logistic Regression

Variational Bayesian Logistic Regression Sargur N. University at Buffalo, State University of New York USA Topics in Linear Models for Classification Overview 1. Discriminant Functions 2. Probabilistic

### Computer Intensive Methods in Mathematical Statistics

Computer Intensive Methods in Mathematical Statistics Department of mathematics johawes@kth.se Lecture 16 Advanced topics in computational statistics 18 May 2017 Computer Intensive Methods (1) Plan of

### Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Linear Models for Regression Linear Regression Probabilistic Interpretation

### Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

### Bayesian Networks BY: MOHAMAD ALSABBAGH

Bayesian Networks BY: MOHAMAD ALSABBAGH Outlines Introduction Bayes Rule Bayesian Networks (BN) Representation Size of a Bayesian Network Inference via BN BN Learning Dynamic BN Introduction Conditional

### Lecture 2: From Linear Regression to Kalman Filter and Beyond

Lecture 2: From Linear Regression to Kalman Filter and Beyond January 18, 2017 Contents 1 Batch and Recursive Estimation 2 Towards Bayesian Filtering 3 Kalman Filter and Bayesian Filtering and Smoothing

### Statistical Approaches to Learning and Discovery

Statistical Approaches to Learning and Discovery Graphical Models Zoubin Ghahramani & Teddy Seidenfeld zoubin@cs.cmu.edu & teddy@stat.cmu.edu CALD / CS / Statistics / Philosophy Carnegie Mellon University

### Probabilistic Reasoning in Deep Learning

Probabilistic Reasoning in Deep Learning Dr Konstantina Palla, PhD palla@stats.ox.ac.uk September 2017 Deep Learning Indaba, Johannesburgh Konstantina Palla 1 / 39 OVERVIEW OF THE TALK Basics of Bayesian

### Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

### An Ensemble Learning Approach to Nonlinear Dynamic Blind Source Separation Using State-Space Models

An Ensemble Learning Approach to Nonlinear Dynamic Blind Source Separation Using State-Space Models Harri Valpola, Antti Honkela, and Juha Karhunen Neural Networks Research Centre, Helsinki University

### Gentle Introduction to Infinite Gaussian Mixture Modeling

Gentle Introduction to Infinite Gaussian Mixture Modeling with an application in neuroscience By Frank Wood Rasmussen, NIPS 1999 Neuroscience Application: Spike Sorting Important in neuroscience and for

### Introduction to Machine Learning Midterm Exam Solutions

10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,

### Basic math for biology

Basic math for biology Lei Li Florida State University, Feb 6, 2002 The EM algorithm: setup Parametric models: {P θ }. Data: full data (Y, X); partial data Y. Missing data: X. Likelihood and maximum likelihood

### Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

### Introduction to Gaussian Process

Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression

Advanced Machine Learning Nonparametric Bayesian Models --Learning/Reasoning in Open Possible Worlds Eric Xing Lecture 7, August 4, 2009 Reading: Eric Xing Eric Xing @ CMU, 2006-2009 Clustering Eric Xing

### Building an Automatic Statistician

Building an Automatic Statistician Zoubin Ghahramani Department of Engineering University of Cambridge zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ ALT-Discovery Science Conference, October

### The Laplace Approximation

The Laplace Approximation Sargur N. University at Buffalo, State University of New York USA Topics in Linear Models for Classification Overview 1. Discriminant Functions 2. Probabilistic Generative Models

### output dimension input dimension Gaussian evidence Gaussian Gaussian evidence evidence from t +1 inputs and outputs at time t x t+2 x t-1 x t+1

To appear in M. S. Kearns, S. A. Solla, D. A. Cohn, (eds.) Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 999. Learning Nonlinear Dynamical Systems using an EM Algorithm Zoubin

### Expectation Maximization (EM)

Expectation Maximization (EM) The EM algorithm is used to train models involving latent variables using training data in which the latent variables are not observed (unlabeled data). This is to be contrasted

### Dynamic Probabilistic Models for Latent Feature Propagation in Social Networks

Dynamic Probabilistic Models for Latent Feature Propagation in Social Networks Creighton Heaukulani and Zoubin Ghahramani University of Cambridge TU Denmark, June 2013 1 A Network Dynamic network data

### Latent Dirichlet Allocation Introduction/Overview

Latent Dirichlet Allocation Introduction/Overview David Meyer 03.10.2016 David Meyer http://www.1-4-5.net/~dmm/ml/lda_intro.pdf 03.10.2016 Agenda What is Topic Modeling? Parametric vs. Non-Parametric Models

### Machine Learning, Midterm Exam

10-601 Machine Learning, Midterm Exam Instructors: Tom Mitchell, Ziv Bar-Joseph Wednesday 12 th December, 2012 There are 9 questions, for a total of 100 points. This exam has 20 pages, make sure you have

### Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

### ECE 5984: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning Topics: (Finish) Expectation Maximization Principal Component Analysis (PCA) Readings: Barber 15.1-15.4 Dhruv Batra Virginia Tech Administrativia Poster Presentation:

### Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning Diederik (Durk) Kingma Danilo J. Rezende (*) Max Welling Shakir Mohamed (**) Stochastic Gradient Variational Inference Bayesian

### Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

Independent Component Analysis and Unsupervised Learning Jen-Tzung Chien TABLE OF CONTENTS 1. Independent Component Analysis 2. Case Study I: Speech Recognition Independent voices Nonparametric likelihood

### Lecture 3: Machine learning, classification, and generative models

EE E6820: Speech & Audio Processing & Recognition Lecture 3: Machine learning, classification, and generative models 1 Classification 2 Generative models 3 Gaussian models Michael Mandel

### Statistical learning. Chapter 20, Sections 1 3 1

Statistical learning Chapter 20, Sections 1 3 Chapter 20, Sections 1 3 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete

### CPSC 540: Machine Learning

CPSC 540: Machine Learning Expectation Maximization Mark Schmidt University of British Columbia Winter 2018 Last Time: Learning with MAR Values We discussed learning with missing at random values in data:

### Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a

### Lecture 2: From Linear Regression to Kalman Filter and Beyond

Lecture 2: From Linear Regression to Kalman Filter and Beyond Department of Biomedical Engineering and Computational Science Aalto University January 26, 2012 Contents 1 Batch and Recursive Estimation

### Latent Variable View of EM. Sargur Srihari

Latent Variable View of EM Sargur srihari@cedar.buffalo.edu 1 Examples of latent variables 1. Mixture Model Joint distribution is p(x,z) We don t have values for z 2. Hidden Markov Model A single time

### Linear Models for Classification

Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

### GAUSSIAN PROCESS REGRESSION

GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The

### Machine Learning Lecture 5

Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

### Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Outline A short review on Bayesian analysis. Binomial, Multinomial, Normal, Beta, Dirichlet Posterior mean, MAP, credible interval, posterior distribution Gibbs sampling Revisit the Gaussian mixture model

### PILCO: A Model-Based and Data-Efficient Approach to Policy Search

PILCO: A Model-Based and Data-Efficient Approach to Policy Search (M.P. Deisenroth and C.E. Rasmussen) CSC2541 November 4, 2016 PILCO Graphical Model PILCO Probabilistic Inference for Learning COntrol

### Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly

### Lecture 6: Bayesian Inference in SDE Models

Lecture 6: Bayesian Inference in SDE Models Bayesian Filtering and Smoothing Point of View Simo Särkkä Aalto University Simo Särkkä (Aalto) Lecture 6: Bayesian Inference in SDEs 1 / 45 Contents 1 SDEs

### Bayesian Inference in Astronomy & Astrophysics A Short Course

Bayesian Inference in Astronomy & Astrophysics A Short Course Tom Loredo Dept. of Astronomy, Cornell University p.1/37 Five Lectures Overview of Bayesian Inference From Gaussians to Periodograms Learning

### Clustering problems, mixture models and Bayesian nonparametrics

Clustering problems, mixture models and Bayesian nonparametrics Nguyễn Xuân Long Department of Statistics Department of Electrical Engineering and Computer Science University of Michigan Vietnam Institute

### Probabilistic Graphical Models

Parameter Estimation December 14, 2015 Overview 1 Motivation 2 3 4 What did we have so far? 1 Representations: how do we model the problem? (directed/undirected). 2 Inference: given a model and partially

### Bayesian Mixtures of Bernoulli Distributions

Bayesian Mixtures of Bernoulli Distributions Laurens van der Maaten Department of Computer Science and Engineering University of California, San Diego Introduction The mixture of Bernoulli distributions

### Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Part I C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Probabilistic Graphical Models Graphical representation of a probabilistic model Each variable corresponds to a

### p(d θ ) l(θ ) 1.2 x x x

p(d θ ).2 x 0-7 0.8 x 0-7 0.4 x 0-7 l(θ ) -20-40 -60-80 -00 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ θ x FIGURE 3.. The top graph shows several training points in one dimension, known or assumed to

### CPSC 540: Machine Learning

CPSC 540: Machine Learning Empirical Bayes, Hierarchical Bayes Mark Schmidt University of British Columbia Winter 2017 Admin Assignment 5: Due April 10. Project description on Piazza. Final details coming

### Machine Learning 4771

Machine Learning 4771 Instructor: Tony Jebara Topic 11 Maximum Likelihood as Bayesian Inference Maximum A Posteriori Bayesian Gaussian Estimation Why Maximum Likelihood? So far, assumed max (log) likelihood

### Part 4: Conditional Random Fields

Part 4: Conditional Random Fields Sebastian Nowozin and Christoph H. Lampert Colorado Springs, 25th June 2011 1 / 39 Problem (Probabilistic Learning) Let d(y x) be the (unknown) true conditional distribution.

### 6.867 Machine learning, lecture 23 (Jaakkola)

Lecture topics: Markov Random Fields Probabilistic inference Markov Random Fields We will briefly go over undirected graphical models or Markov Random Fields (MRFs) as they will be needed in the context