An Introduction to Bayesian Machine Learning

Similar documents
NPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic

Chris Bishop s PRML Ch. 8: Graphical Models

Chapter 16. Structured Probabilistic Models for Deep Learning

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Probabilistic Graphical Models (I)

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Lecture 6: Graphical Models: Learning

Directed and Undirected Graphical Models

Learning Bayesian network : Given structure and completely observed data

Rapid Introduction to Machine Learning/ Deep Learning

Undirected Graphical Models: Markov Random Fields

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Machine Learning Summer School

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Probabilistic Machine Learning

Directed and Undirected Graphical Models

Statistical Learning

Graphical Models and Kernel Methods

Bayesian Machine Learning

Introduction to Bayesian Learning

Probability Distributions

CSC 412 (Lecture 4): Undirected Graphical Models

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

CPSC 540: Machine Learning

Directed Graphical Models or Bayesian Networks

Probabilistic Graphical Networks: Definitions and Basic Results

Conditional Independence and Factorization

4.1 Notation and probability review

STA 4273H: Statistical Machine Learning

Lecture 4 October 18th

Based on slides by Richard Zemel

2 : Directed GMs: Bayesian Networks

Graphical Models for Collaborative Filtering

Probability. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. August 2014

Probabilistic Graphical Models and Bayesian Networks. Artificial Intelligence Bert Huang Virginia Tech

Introduction to Probabilistic Graphical Models

Directed Graphical Models

Undirected Graphical Models

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Probabilistic Graphical Models

Review: Directed Models (Bayes Nets)

Machine Learning Lecture 14

Intelligent Systems (AI-2)

Introduction to Probabilistic Graphical Models

1 Undirected Graphical Models. 2 Markov Random Fields (MRFs)

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

Variational Inference. Sargur Srihari

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

3 : Representation of Undirected GM

STA 4273H: Statistical Machine Learning

Learning in Bayesian Networks

STA 4273H: Statistical Machine Learning

Recall from last time. Lecture 3: Conditional independence and graph structure. Example: A Bayesian (belief) network.

PMR Learning as Inference

Statistical Approaches to Learning and Discovery

Machine Learning. Probability Basics. Marc Toussaint University of Stuttgart Summer 2014

Graphical Models - Part I

Probabilistic Graphical Models

Lecture 6: Graphical Models

Junction Tree, BP and Variational Methods

p L yi z n m x N n xi

Graphical Models 359

1. what conditional independencies are implied by the graph. 2. whether these independecies correspond to the probability distribution

CS6220: DATA MINING TECHNIQUES

Data Mining Techniques. Lecture 3: Probability

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Probabilistic Graphical Models for Image Analysis - Lecture 1

Probabilistic Graphical Models

The Monte Carlo Method: Bayesian Networks

Graphical Models. Andrea Passerini Statistical relational learning. Graphical Models

Expectation Propagation for Approximate Bayesian Inference

Introduction to Probability and Statistics (Continued)

{ p if x = 1 1 p if x = 0

Intelligent Systems (AI-2)

Introduction to Bayes Nets. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Using Graphs to Describe Model Structure. Sargur N. Srihari

Naïve Bayes classification

Approximate Inference using MCMC

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

Recent Advances in Bayesian Inference Techniques

Directed Probabilistic Graphical Models CMSC 678 UMBC

CS Lecture 4. Markov Random Fields

Bayesian Networks BY: MOHAMAD ALSABBAGH

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

13: Variational inference II

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Intelligent Systems: Reasoning and Recognition. Reasoning with Bayesian Networks

Variational Scoring of Graphical Model Structures

CS839: Probabilistic Graphical Models. Lecture 2: Directed Graphical Models. Theo Rekatsinas

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Machine Learning

The Origin of Deep Learning. Lili Mou Jan, 2015

INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Undirected Graphical Models

Artificial Intelligence: Cognitive Agents

Intelligent Systems (AI-2)

Introduction to Bayesian inference

Conditional Independence

Transcription:

1 An Introduction to Bayesian Machine Learning José Miguel Hernández-Lobato Department of Engineering, Cambridge University April 8, 2013

2 What is Machine Learning? The design of computational systems that discover patterns in a collection of data instances in an automated manner. The ultimate goal is to use the discovered patterns to make predictions on new data instances not seen before. 0 0 0 0 0 0 0??? 1 1 1 1 1 1 1 2 2 2 2 2 2 2?????? Instead of manually encoding patterns in computer programs, we make computers learn these patterns without explicitly programming them. Figure source [Hinton et al. 2006].

3 Model-based Machine Learning We design a probabilistic model which explains how the data is generated. An inference algorithm combines model and data to make predictions. Probability theory is used to deal with uncertainty in the model or the data. 1 2 New Data? Data 0 2 1 Prob. Model Inference Algorithm? Prediction 0: 0.900 1: 0.005 2: 0.095

Basics of Probability Theory The theory of probability can be derived using just two rules: Sum rule: Product rule: p(x) = p(x, y) dy. p(x, y) = p(y x)p(x) = p(x y)p(y). They can be combined to obtain Bayes rule: p(y x) = p(x y)p(y) p(x) = p(y x)p(y). p(x, y) dy Independence of X and Y : p(x, y) = p(x)p(y). Conditional independence of X and Y given Z: p(x, y z) = p(x z)p(y z). 4

The Bayesian Framework The probabilistic model M with parameters θ explains how the data D is generated by specifying the likelihood function p(d θ, M). Our initial uncertainty on θ is encoded in the prior distribution p(θ M). Bayes rule allows us to update our uncertainty on θ given D: p(θ D, M) = p(d θ, M)p(θ M) p(d M). We can then generate probabilistic predictions for some quantity x of new data D new given D and M using p(x D new, D, M) = p(x θ, D new, M)p(θ D, M)dθ. These predictions will be approximated using an inference algorithm. 5

Bayesian Model Comparison Given a particular D, we can use the model evidence p(d M) to reject both overly simple models, and overly complex models. Figure source [Ghahramani 2012]. 6

Probabilistic Graphical Models The Bayesian framework requires to specify a high-dimensional distribution p(x 1,..., x k ) on the data, model parameters and latent variables. Working with fully flexible joint distributions is intractable! We will work with structured distributions, in which the random variables interact directly with only few others. These distributions will have many conditional independencies. This structure will allow us to: - Obtain a compact representation of the distribution. - Use computationally efficient inference algorithms. The framework of probabilistic graphical models allows us to represent and work with such structured distributions in an efficient manner. 7

Some Examples of Probabilistic Graphical Models Graphs Bayesian Network Season Markov Network A Flu Hayfever D B Muscle-Pain Congestion C Independencies Factorization (F H C), (C S F, H) (A C B, D), (B D A, C) (M H, C F ), (M C F ),... p(s, F, H, M, C) = p(s)p(f S) p(a, B, C, D) = 1 Z φ 1(A, B) p(h S)p(C F, H)p(M F ) φ 2 (B, C)φ 3 (C, D)φ 4 (A, D) Figure source [Koller et al. 2009]. 8

Bayesian Networks A BN G is a DAG whose nodes are random variables X 1,..., X d. Let PA G X i be the parents of X i in G. The network is annotated with the conditional distributions p(x i PA G X i ). Conditional Independencies: Let ND G X i be the variables in G which are non-descendants of X i in G. G encodes the conditional independencies (X i ND G X i PA G X i ), i = 1,..., d. Factorization: G encodes the factorization p(x 1,..., x d ) = d i=1 p(x i pa G x i ). 9

10 BN Examples: Naive Bayes We have features X = (X 1,..., X d ) R n and a label Y {1,..., C}. The figure shows the Naive Bayes graphical model with parameters θ 1 and θ 2 for a dataset D = {x i, y i } d i=1 using plate notation. Shaded nodes are observed, unshaded nodes are not observed. The joint distribution for D θ 1 and θ 2 is n d p(d, θ 1 θ 2 ) = p(x i,j y i, θ 1 )p(y i θ 2 ) i=1 j=1 p(θ 1 )p(θ 2 ). Figure source [Murphy 2012].

BN Examples: Hidden Markov Model We have a temporal sequence of measurements D = {X 1,..., X T }. Time dependence is explained by a hidden process Z = {Z 1,..., Z T }, that is modeled by a first-order Markov chain. The data is a noisy observation of the hidden process. Figure source [Murphy 2012]. The joint distribution for D, Z, θ 1 and θ 2 is [ T ] p(d, Z, θ 1, θ 2 ) = p(x t z t, θ 1 ) t=1 [ T ] p(z t z t 1, θ 2 ) t=1 p(θ 1 )p(θ 2 )p(z 0 ). 11

12 BN Examples: Matrix Factorization Model We have observations x i,j from an n d matrix X. The entries of X are generated as a function of the entries of a low rank matrix UV T, U is n k and V is d k and k min(n, d). p(d,u, V, θ 1, θ 2, θ 3 ) = n d p(x i,j u i, v j, θ 3 ) i=1 j=1 [ n p(u i θ 1 )] d p(v j θ 2 ) i=1 i=j p(θ 1 )p(θ 2 )p(θ 3 ).

D-separation Conditional independence properties can be read directly from the graph. We say that the sets of nodes A, B and C satisfy (A B C) when all of the possible paths from any node in A to any node in B are blocked. A path will be blocked if it contains a node x with arrows meeting at x 1 - i) head-to-tail or ii) tail-to-tail and x is C. 2 - head-to-head and neither x, nor any of its descendants, is in C. a f a f (a b c) does not follow from the graph. e c b e c b (a b f ) is implied by the graph. Figure source [Bishop 2006]. 13

14 Bayesian Networks as Filters p(x 1,..., x k ) must satisfy the CIs implied by d-separation. p(x 1,..., x k ) must factorize as p(x 1,..., x d ) = d i=1 p(x i pa G x i ). p(x 1,..., x k ) must satisfy the CIs (X i ND G X i PA G X i ), i = 1,..., d. The three filters are the same! All possible Distributions All distributions that factorize and satisfy CIs Figure source [Bishop 2006].

15 Markov Blanket The Markov blanket of a node x is the set of nodes comprising parents, children and co-parents of x. It is the minimal set of nodes that isolates a node from the rest, that is, x is CI of any other node in the graph given its Markov blanket. x Figure source [Bishop 2006].

16 Markov Networks A MN is an undirected graph G whose nodes are the r.v. X 1,..., X d. It is annotated with the potential functions φ 1 (D 1 ),..., φ k (D k ), where D 1,..., D k are sets of variables, each forming a maximal clique of G, and φ i,..., φ k are positive functions. Conditional Independencies: G encodes the conditional independencies (A B C) for any sets of nodes A, B and C such that C separates A from B in G. Factorization: G encodes the factorization p(x 1,..., X d ) = Z 1 k i=1 φ i(d i ), where Z is a normalization constant.

17 Clique and Maximal Clique Clique : Fully connected subset of nodes. Maximal Clique : A clique in which we cannot include any more nodes without it ceasing to be a clique. x 1 x 2 (x 1, x 2 ) is a clique but not a maximal clique. (x 2, x 3, x 4 ) is a maximal clique. x 3 x 4 p(x 1,..., x 4 ) = 1 Z φ 1(x 1, x 2, x 3 )φ 2 (x 2, x 3, x 4 ). Figure source [Bishop 2006].

MN Examples: Potts Model Let x 1,..., x n {1,..., C}, where p(x 1,..., x n ) = 1 φ ij (x i, xj), Z log φ ij (x i, xj) = i j { β > 0 if xi = x j 0 otherwise, x i x j Figure source [Bishop 2006]. Figure source Erik Sudderth. 18

19 From Directed Graphs to Undirected Graphs: Moralization Let p(x 1,..., x 4 ) = p(x 1 )p(x 2 )p(x 3 )p(x 4 x 1, x 2, x 3 ). How do we obtain the corresponding undirected model? p(x 4 x 1, x 2, x 3 ) implies that x 1,..., x 4 must be in a maximal clique. General method: 1- Fully connect all the parents of any node. 2- Eliminate edge directions. x 1 x 3 x 1 x 3 x 2 x 2 Moralization adds the fewest extra links and so retains the maximum number of CIs. x 4 x 4 Figure source [Bishop 2006].

20 CIs in Directed and Undirected Models Markov Network in Undirected Models If all the CIs of p(x 1,..., x n ) are reflected in G, and vice versa, then G is said to be a perfect map. A B C A B D U P C D Figure source [Bishop 2006].

21 Basic Distributions: Bernoulli Distribution for x {0, 1} governed by µ [0, 1] such that µ = p(x = 1). Bern(x µ) = xµ + (1 x)(1 µ). E(x) = µ. Var(x) = µ(1 µ). 0 1

Basic Distributions: Beta Distribution for µ [0, 1] such as the prob. of a binary event. Beta(µ a, b) = Γ(a + b) Γ(a)Γ(b) µa 1 (1 µ) b 1. E(x) = a/(a + b). Var(x) = ab/((a + b) 2 (a + b + 1)). pdf 0.0 0.5 1.0 1.5 2.0 2.5 3.0 a = 0.5, b = 0.5 a = 5, b = 1 a = 1, b = 3 a = 2, b = 2 a = 2, b = 5 0.0 0.2 0.4 0.6 0.8 1.0 x 22

23 Basic Distributions: Multinomial We extract with replacement n balls of k different categories from a bag. Let x i and denote the number of balls extracted and p i the probability, both of category i = 1,..., k. p(x 1,..., x k n, p 1,..., p k ) = n! x 1! x k! px 1 1 px k k if k i=1 x k = n 0 otherwise. E(x i ) = np i. Var(x i ) = np i (1 p i ). Cov(x i, x j ) = np i p j (1 p i ).

24 Basic Distributions: Dirichlet Multivariate distribution over µ 1,..., µ k [0, 1], where k i=1 µ i = 1. Parameterized in terms of α = (α 1,..., α k ) with α i > 0 for i = 1,..., k. ( k ) Γ i=1 α k k Dir(µ 1,..., µ k α) = µ α i i. Γ(α 1 ) Γ(α k ) i=1 E(µ i ) = a i k j=1 a. j Figure source [Murphy 2012].

Basic Distributions: Multivariate Gaussian E(x) = µ. { 1 1 p(x µ, Σ) = (2π) n/2 exp 1 } Σ 1/2 2 (x µ)t Σ 1 (x µ). Cov(x) = Σ. Figure source [Murphy 2012]. 25

26 Basic Distributions: Wishart Distribution for the precision matrix Λ = Σ 1 of a Multivariate Gaussian. { W(Λ w, ν) = B(W, ν) Λ (ν D 1) exp 1 } 2 Tr(W 1 Λ), where ( D ( ) ) ν + 1 i B(W, ν) W ν/2 2 vd/2 π D(D 1)/4 Γ. 2 i=1 E(Λ) = νw.

27 Summary With ML computers learn patterns and then use them to make predictions. With ML we avoid to manually encode patterns in computer programs. Model-based ML separates knowledge about the data generation process (model) from reasoning and prediction (inference algorithm). The Bayesian framework allows us to do model-based ML using probability distributions which must be structured for tractability. Probabilistic graphical models encode such structured distributions by specifying several CIs (factorizations) that they must satisfy. Bayesian Networks and Markov Networks are two different types of graphical models which can express different types of CIs.

28 References Hinton, G. E., Osindero, S. and Teh, Y. A fast learning algorithm for deep belief nets. Neural Computation 18, 2006, 1527-1554. Koller, D. and Friedman, N. Probabilistic Graphical Models: Principles and Techniques Mit Press, 2009 Bishop, C. M. Model-based machine learning Philosophical Transactions of the Royal Society A: Mathematical,Physical and Engineering Sciences, 2013, 371 Ghahramani Z. Bayesian nonparametrics and the probabilistic approach to modelling. Philosophical Transactions of the Royal Society A, 2012. Murphy, K. Machine Learning: A Probabilistic Perspective Mit Press, 2012