Unsupervised Learning

Similar documents
Statistical Approaches to Learning and Discovery

Variational Scoring of Graphical Model Structures

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

Variational inference in the conjugate-exponential family

The Variational Bayesian EM Algorithm for Incomplete Data: with Application to Scoring Graphical Model Structures

Machine Learning Summer School

Bayesian Hidden Markov Models and Extensions

Lecture 6: Graphical Models: Learning

Pattern Recognition and Machine Learning

Recent Advances in Bayesian Inference Techniques

Probabilistic Modelling and Bayesian Inference

Lecture 13 : Variational Inference: Mean Field Approximation

13: Variational inference II

Probabilistic Modelling and Bayesian Inference

STA 4273H: Statistical Machine Learning

An introduction to Variational calculus in Machine Learning

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

p L yi z n m x N n xi

Density Estimation. Seungjin Choi

Variational Principal Components

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

Bayesian Inference Course, WTCN, UCL, March 2013

Bayesian Machine Learning

STA 4273H: Statistical Machine Learning

Integrated Non-Factorized Variational Inference

an introduction to bayesian inference

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Introduction to Machine Learning

Part 1: Expectation Propagation

Variational Learning : From exponential families to multilinear systems

Bayesian Learning in Undirected Graphical Models

Variational Message Passing. By John Winn, Christopher M. Bishop Presented by Andy Miller

Introduction to Machine Learning

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

Nonparametric Bayesian Methods (Gaussian Processes)

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Introduction to Probabilistic Machine Learning

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

G8325: Variational Bayes

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group

Bayesian Regression Linear and Logistic Regression

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Introduction to Probabilistic Modelling and Machine Learning

Lecture : Probabilistic Machine Learning

Probabilistic & Unsupervised Learning

Lecture 9: PGM Learning

Graphical models: parameter learning

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

PMR Learning as Inference

Probabilistic and Bayesian Machine Learning

Expectation Propagation for Approximate Bayesian Inference

Linear Dynamical Systems

Lecture 4: Probabilistic Learning

Lecture 1a: Basic Concepts and Recaps

Graphical Models and Kernel Methods

CSC 2541: Bayesian Methods for Machine Learning

Active and Semi-supervised Kernel Classification

Lecture 1b: Linear Models for Regression

VIBES: A Variational Inference Engine for Bayesian Networks

Bayesian Machine Learning

Machine Learning Techniques for Computer Vision

Bayesian Machine Learning

Probabilistic Graphical Models

Graphical Models for Collaborative Filtering

The Variational Gaussian Approximation Revisited

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Bayesian Learning in Undirected Graphical Models

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Probabilistic Time Series Classification

Learning Bayesian network : Given structure and completely observed data

Ch 4. Linear Models for Classification

Statistical learning. Chapter 20, Sections 1 4 1

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Machine Learning. Probability Basics. Marc Toussaint University of Stuttgart Summer 2014

Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007

Statistical Learning Theory of Variational Bayes

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Learning Gaussian Process Models from Uncertain Data

Lecture 8: Bayesian Estimation of Parameters in State Space Models

Foundations of Statistical Inference

Bayesian Methods for Machine Learning

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Bayesian Networks Inference with Probabilistic Graphical Models

PATTERN CLASSIFICATION

Probabilistic & Unsupervised Learning

COS513 LECTURE 8 STATISTICAL CONCEPTS

Expectation Propagation Algorithm

Sequential Monte Carlo Methods for Bayesian Computation

Variational Inference via Stochastic Backpropagation

Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Study Notes on the Latent Dirichlet Allocation

Transcription:

Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College London Term 1, Autumn 5

Model complexity and overfitting: a simple example M = M = 1 M = 2 M = 3 4 4 4 4 5 1 5 1 5 1 5 1 M = 4 M = 5 M = 6 M = 7 4 4 4 4 5 1 5 1 5 1 5 1

Learning Model Structure How many clusters in the data? What is the intrinsic dimensionality of the data? Is this input relevant to predicting that output? What is the order of a dynamical system? How many states in a hidden Markov model? SVYDAAAQLTADVKKDLRDSWKVIGSDKKGNGVALMTTY How many auditory sources in the input?

Using Occam s Razor to Learn Model Structure Compare model classes m using their posterior probability given the data: P (y m)p (m) P (m y) =, P (y m) = P (y θ m, m)p (θ m m) dθ m P (y) Θ m Interpretation of P (y m): The probability that randomly selected parameter values from the model class would generate data set y. Model classes that are too simple are unlikely to generate the data set. Model classes that are too complex can generate many possible data sets, so again, they are unlikely to generate that particular data set at random. too simple P(Y M i ) "just right" too complex Y All possible data sets

Bayesian Model Comparison: Terminology A model class m is a set of models parameterised by θ m, e.g. the set of all possible mixtures of m Gaussians. The marginal likelihood of model class m: P (y m) = P (y θ m, m)p (θ m m) dθ m Θ m is also known as the Bayesian evidence for model m. The ratio of two marginal likelihoods is known as the Bayes factor: P (y m) P (y m ) The Occam s Razor principle is, roughly speaking, that one should prefer simpler explanations than more complex explanations. Bayesian inference formalises and automatically implements the Occam s Razor principle.

Bayesian Model Comparison: Occam s Razor at Work M = M = 1 M = 2 M = 3 4 4 4 4 1 Model Evidence.8 5 1 M = 4 5 1 M = 5 5 1 M = 6 5 1 M = 7 P(Y M).6.4 4 4 4 4.2 1 2 3 4 5 6 7 M 5 1 5 1 5 1 5 1 e.g. for quadratic (M=2): y = a +a 1 x+a 2 x 2 +ɛ, where ɛ N (, τ) and θ 2 = [a a 1 a 2 τ] demo: polybayes

Practical Bayesian approaches Laplace approximations: Makes a Gaussian approximation about the maximum a posteriori parameter estimate. Bayesian Information Criterion (BIC) an asymptotic approximation. Markov chain Monte Carlo methods (MCMC): In the limit are guaranteed to converge, but: Many samples required to ensure accuracy. Sometimes hard to assess convergence. Variational approximations Note: other deterministic approximations have been developed more recently and can be applied in this context: e.g. Bethe approximations and Expectation Propagation

Laplace Approximation data set: y models: m = 1..., M parameter sets: θ 1..., θ M Model Comparison: P (m y) P (m)p (y m) For large amounts of data (relative to number of parameters, d) the parameter posterior is approximately Gaussian around the MAP estimate ˆθ m : P (θ m y, m) (2π) d 1 2 A 2 exp { 1 } 2 (θ m ˆθ m ) A(θ m ˆθ m ) P (y m) = P (θ m, y m) P (θ m y, m) Evaluating the above for ln P (y m) at ˆθ m we get the Laplace approximation: ln P (y m) ln P (ˆθ m m) + ln P (y ˆθ m, m) + d 2 ln 2π 1 2 ln A A is the d d Hessian matrix of log P (θ m y, m): A kl = 2 θ mk θ ln P (θ m y,. ml m) ˆθm Can also be derived from 2 nd order Taylor expansion of log posterior. The Laplace approximation can be used for model comparison.

Bayesian Information Criterion (BIC) BIC can be obtained from the Laplace approximation: ln P (y m) ln P (ˆθ m m) + ln P (y ˆθ m, m) + d 2 ln 2π 1 2 ln A in the large sample limit (N ) where N is the number of data points, A grows as NA for some fixed matrix A, so ln A ln NA = ln(n d A ) = d ln N + ln A. Retaining only terms that grow in N we get: ln P (y m) ln P (y ˆθ m, m) d 2 ln N Properties: Quick and easy to compute It does not depend on the prior We can use the ML estimate of θ instead of the MAP estimate It is equivalent to the Minimum Description Length (MDL) criterion It assumes that in the large sample limit, all the parameters are well-determined (i.e. the model is identifiable; otherwise, d should be the number of well-determined parameters) Danger: counting parameters can be deceiving! (c.f. sinusoid, infinite models)

Sampling Approximations Let s consider a non-markov chain method, Importance Sampling: ln P (y m) = ln P (y θ m, m)p (θ m m) dθ m Θ m = ln P (y θ m, m) P (θ m m) Θ m Q(θ m ) Q(θ m) dθ m ln 1 K P (y θ (k) m, m) P (θ(k) m m) k Q(θ (k) m ) where θ (k) m are i.i.d. draws from Q(θ m ). Assumes we can sample from and evaluate Q(θ m ) (incl. normalization!) and we can compute the likelihood P (y θ (k) m, m). Although importance sampling does not work well in high dimensions, it inspires the following approach: Create a Markov chain, Q k Q k+1... for which: Q k (θ) can be evaluated including normalization lim k Q k (θ) = P (θ y, m)

Variational Bayesian Learning Lower Bounding the Marginal Likelihood Let the hidden latent variables be x, data y and the parameters θ. Lower bound the marginal likelihood (Bayesian model evidence) using Jensen s inequality: ln P (y) = ln = ln dx dθ P (y, x, θ) P (y, x, θ) dx dθ Q(x, θ) Q(x, θ) dx dθ Q(x, θ) ln P (y, x, θ) Q(x, θ). Use a simpler, factorised approximation to Q(x, θ): ln P (y) dx dθ Q x (x)q θ (θ) ln Maximize this lower bound. = F(Q x (x), Q θ (θ), y). P (y, x, θ) Q x (x)q θ (θ) m

Variational Bayesian Learning... Maximizing this lower bound, F, leads to EM-like updates: Q x(x) exp ln P (x,y θ) Qθ (θ) E like step Q θ(θ) P (θ) exp ln P (x,y θ) Qx (x) M like step Maximizing F is equivalent to minimizing KL-divergence between the approximate posterior, Q(θ)Q(x) and the true posterior, P (θ, x y). ln P (y) ln P (y) F(Q x (x), Q θ (θ), y) = dx dθ Q x (x)q θ (θ) ln P (y, x, θ) Q x (x)q θ (θ) dx dθ Q x (x)q θ (θ) ln Q x(x)q θ (θ) P (x, θ y) = = KL(Q P )

Conjugate-Exponential models Let s focus on conjugate-exponential (CE) models, which satisfy (1) and (2): Condition (1). The joint probability over variables is in the exponential family: P (x, y θ) = f(x, y) g(θ) exp { φ(θ) u(x, y) } where φ(θ) is the vector of natural parameters, u are sufficient statistics Condition (2). The prior over parameters is conjugate to this joint probability: P (θ η, ν) = h(η, ν) g(θ) η exp { φ(θ) ν } where η and ν are hyperparameters of the prior. Conjugate priors are computationally convenient and have an intuitive interpretation: η: number of pseudo-observations ν: values of pseudo-observations

Conjugate-Exponential examples In the CE family: Gaussian mixtures factor analysis, probabilistic PCA hidden Markov models and factorial HMMs linear dynamical systems and switching models discrete-variable belief networks Other as yet undreamt-of models can combine Gaussian, Gamma, Poisson, Dirichlet, Wishart, Multinomial and others. Not in the CE family: Boltzmann machines, MRFs (no simple conjugacy) logistic regression (no simple conjugacy) sigmoid belief networks (not exponential) independent components analysis (not exponential) Note: one can often approximate these models with models in the CE family.

A Useful Result Given an iid data set y = (y 1,... y n ), if the model is CE then: (a) Q θ (θ) is also conjugate, i.e. Q θ (θ) = h( η, ν)g(θ) η exp { φ(θ) ν } where η = η + n and ν = ν + i u(x i, y i ). (b) Q x (x) = n i=1 Q x i (x i ) is of the same form as in the E step of regular EM, but using pseudo parameters computed by averaging over Q θ (θ) Q xi (x i ) f(x i, y i ) exp { φ(θ) u(x i, y i ) } = P (x i y i, φ(θ)) KEY points: (a) the approximate parameter posterior is of the same form as the prior, so it is easily summarized in terms of two sets of hyperparameters, η and ν; (b) the approximate hidden variable posterior, averaging over all parameters, is of the same form as the hidden variable posterior for a single setting of the parameters, so again, it is easily computed using the usual methods.

The Variational Bayesian EM algorithm EM for MAP estimation Goal: maximize p(θ y, m) w.r.t. θ E Step: compute Variational Bayesian EM Goal: lower bound p(y m) VB-E Step: compute q (t+1) x (x) = p(x y, θ (t) ) q (t+1) x (x) = p(x y, φ (t) ) M Step: θ (t+1) =argmax θ Z q (t+1) x (x) ln p(x, y, θ) dx VB-M Step:»Z q (t+1) θ (θ) = exp q (t+1) x (x) ln p(x, y, θ) dx Properties: Reduces to the EM algorithm if q θ (θ) = δ(θ θ ). F m increases monotonically, and incorporates the model complexity penalty. Analytical parameter distributions (but not constrained to be Gaussian). VB-E step has same complexity as corresponding E step. We can use the junction tree, belief propagation, Kalman filter, etc, algorithms in the VB-E step of VB-EM, but using expected natural parameters, φ.

Variational Bayes: History of Models Treated multilayer perceptrons (Hinton & van Camp, 1993) mixture of experts (Waterhouse, MacKay & Robinson, 1996) hidden Markov models (MacKay, 1995) other work by Jaakkola, Jordan, Barber, Bishop, Tipping, etc Examples of Variational Learning of Model Structure mixtures of factor analysers (Ghahramani & Beal, 1999) mixtures of Gaussians (Attias, 1999) independent components analysis (Attias, 1999; Miskin & MacKay, ; Valpola ) principal components analysis (Bishop, 1999) linear dynamical systems (Ghahramani & Beal, ) mixture of experts (Ueda & Ghahramani, ) discrete graphical models (Beal & Ghahramani, 2) VIBES software for conjugate-exponential graphs (Winn, 3)

Mixture of Factor Analysers Goal: Infer number of clusters Infer intrinsic dimensionality of each cluster Under the assumption that each cluster is Gaussian embed demo

Mixture of Factor Analysers True data: 6 Gaussian clusters with dimensions: (1 7 4 3 2 2) embedded in 1-D Inferred structure: number of points intrinsic dimensionalities per cluster 1 7 4 3 2 2 8 2 1 8 1 2 16 1 4 2 32 1 6 3 3 2 2 64 1 7 4 3 2 2 128 1 7 4 3 2 2 Finds the clusters and dimensionalities efficiently. The model complexity reduces in line with the lack of data support. demos: run simple and ueda demo

Hidden Markov Models S 1 S 2 S 3 S T Y 1 Y 2 Y 3 Y T Discrete hidden states, s t. Observations y t. How many hidden states? What structure state-transition matrix? demo: vbhmm demo

Hidden Markov Models: Discriminating Forward from Reverse English First 8 sentences from Alice in Wonderland. Compare VB-HMM with ML-HMM. 1 Discriminating Forward and Reverse English 1 Model Selection using F 11 Classification Accuracy.8.6 F 1 13 14 ML VB1 VB2 15.4 1 3 4 5 6 Hidden States 16 1 3 4 5 6 Hidden States

Linear Dynamical Systems X 1 X 2 X 3 X T Y 1 Y 2 Y 3 Y T Assumes y t generated from a sequence of Markov hidden state variables x t If transition and output functions are linear, time-invariant, and noise distributions are Gaussian, this is a linear-gaussian state-space model: x t = Ax t 1 + w t, y t = Cx t + v t Three levels of inference: I Given data, structure and parameters, Kalman smoothing hidden state; II Given data and structure, EM hidden state and parameter point estimates; III Given data only, VEM model structure and distributions over parameters and hidden state.

Linear Dynamical System Results Inferring model structure: a) SSM(,3) i.e. FA b) SSM(3,3) c) SSM(3,4) a b c X X Y X X Y X X Y t 1 t t t 1 t t t 1 t t Inferred model complexity reduces with less data: True model: SSM(6,6) 1-dim observation vector. d 4 e 35 f 25 g 1 h 3 i j 1 X t 1 X t Y t X t 1 X t Y t X t 1 X t Y t X t 1 X t Y t X t 1 X t Y t X t 1 X t Y t X t 1 X t Y t demo: bayeslds

Independent Components Analysis Blind Source Separation: 5 1 msec speech and music sources linearly mixed to produce 11 signals (microphones) from Attias ()

Summary & Conclusions Bayesian learning avoids overfitting and can be used to do model comparison / selection. But we need approximations: Laplace BIC Sampling Variational

Other topics I would have liked to cover in the course but did not have time to... Nonparametric Bayesian learning, infinite models and Dirichlet Processes Bayes net structure learning Nonlinear dimensionality reduction More on loopy belief propagation, Bethe free energies, and Expectation Propagation Exact sampling Particle filters Semi-supervised learning Gaussian processes and SVMs (supervised!) Reinforcement Learning