Bayesian Machine Learning

Similar documents
Bayesian Machine Learning

Bayesian Machine Learning

Probabilistic Graphical Models Lecture 20: Gaussian Processes

Lecture : Probabilistic Machine Learning

20: Gaussian Processes

Chris Bishop s PRML Ch. 8: Graphical Models

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Rapid Introduction to Machine Learning/ Deep Learning

Probabilistic Graphical Models (I)

Statistical Approaches to Learning and Discovery

Conditional Independence

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

{ p if x = 1 1 p if x = 0

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Machine Learning Lecture 14

Lecture 6: Graphical Models: Learning

Directed Graphical Models

Graphical Models - Part I

Machine Learning Summer School

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Learning Bayesian network : Given structure and completely observed data

Graphical Models and Kernel Methods

Graphical Models 359

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

CS-E3210 Machine Learning: Basic Principles

Bayesian RL Seminar. Chris Mansley September 9, 2008

T Machine Learning: Basic Principles

Machine Learning Summer School

Probabilistic Graphical Networks: Definitions and Basic Results

Introduction to Bayesian Learning. Machine Learning Fall 2018

CPSC 540: Machine Learning

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Bayesian Networks (Part II)

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Intelligent Systems (AI-2)

Chapter 16. Structured Probabilistic Models for Deep Learning

An Introduction to Bayesian Machine Learning

Part 1: Expectation Propagation

Unsupervised Learning

Introduction into Bayesian statistics

Nonparameteric Regression:

STA 4273H: Statistical Machine Learning

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

p L yi z n m x N n xi

Lecture 4: Probabilistic Learning

Variational Scoring of Graphical Model Structures

STA414/2104 Statistical Methods for Machine Learning II

Probabilistic Graphical Models and Bayesian Networks. Artificial Intelligence Bert Huang Virginia Tech

Recall from last time. Lecture 3: Conditional independence and graph structure. Example: A Bayesian (belief) network.

CPSC 540: Machine Learning

Directed and Undirected Graphical Models

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

PMR Learning as Inference

an introduction to bayesian inference

Density Estimation. Seungjin Choi

Intelligent Systems: Reasoning and Recognition. Reasoning with Bayesian Networks

COS402- Artificial Intelligence Fall Lecture 10: Bayesian Networks & Exact Inference

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Outline. CSE 573: Artificial Intelligence Autumn Bayes Nets: Big Picture. Bayes Net Semantics. Hidden Markov Models. Example Bayes Net: Car

Naïve Bayes classification

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Graphical Models - Part II

Bayesian Networks BY: MOHAMAD ALSABBAGH

Introduction to Probabilistic Graphical Models

Probabilistic Graphical Models & Applications

STA 4273H: Sta-s-cal Machine Learning

MODULE -4 BAYEIAN LEARNING

CPSC 540: Machine Learning

Bayesian Regression Linear and Logistic Regression

STA 4273H: Statistical Machine Learning

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Probabilistic Graphical Models

Probabilistic & Bayesian deep learning. Andreas Damianou

Introduction to Machine Learning Midterm, Tues April 8

Gaussian Process Regression

p(x) p(x Z) = y p(y X, Z) = αp(x Y, Z)p(Y Z)

Computational Cognitive Science

13: Variational inference II

An Introduction to Statistical and Probabilistic Linear Models

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Machine Learning

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

Based on slides by Richard Zemel

Lecture 9: Bayesian Learning

Probabilistic Graphical Models: Representation and Inference

STA 4273H: Statistical Machine Learning

Non-Parametric Bayes

Soft Computing. Lecture Notes on Machine Learning. Matteo Matteucci.

Recent Advances in Bayesian Inference Techniques

Expectation Propagation for Approximate Bayesian Inference

Introduction to Bayesian inference

Bayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses

1 Bayesian Linear Regression (BLR)

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

Transcription:

Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September 1, 2016 1 / 46

References Bishop (2006), MacKay (2003), Rasmussen and Ghahramani (2001), Ghahramani (2015), Ghahramani (2014), Wilson (2014). 2 / 46

Bayesian Modelling (Theory of Everything) 3 / 46

Regularisation = MAP Bayesian Inference Example: Density Estimation Observations y 1,..., y N drawn from unknown density p(y). Model p(y θ) = w 1N (y µ 1, σ 2 1) + w 2N (y µ 2, σ 2 2), θ = {w 1, w 2, µ 1, µ 2, σ 1, σ 2}. Likelihood p(y θ) = N i=1 p(yi θ). Can learn all free parameters θ using maximum likelihood... 4 / 46

Regularisation = MAP Bayesian Inference Regularisation or MAP Find argmax θ log p(θ y) c = model fit {}}{ complexity penalty {}}{ log p(y θ) + log p(θ) Choose p(θ) such that p(θ) 0 faster than p(y θ) as σ 1 or σ 2 0. Bayesian Inference Predictive Distribution: p(y y) = p(y θ)p(θ y)dθ. Parameter Posterior: p(θ y) p(y θ)p(θ). p(θ) need not be zero anywhere in order to make reasonable inferences. Can use a sampling scheme, with conjugate posterior updates for each separate mixture component, using an inverse Gamma prior on the variances σ 2 1, σ 2 2. 5 / 46

Model Selection and Marginal Likelihood p(y M 1, X) = p(y f 1(x, w))p(w)dw (1) Complex Model Simple Model Appropriate Model p(y M) y All Possible Datasets 6 / 46

Model Comparison p(h 1 D) p(h = p(d H1) p(h 1) 2 D) p(d H 2) p(h. (2) 2) 7 / 46

Blackboard: Examples of Occam s Razor in Everyday Inferences For further reading, see MacKay (2003) textbook, Information Theory, Inference, and Learning Algorithms. 8 / 46

Occam s Razor Example -1, 3, 7, 11,??,?? H 1 : the sequence is an arithmetic progression, add n, where n is an integer. H 2 : the sequence is generated by a cubic function of the form cx 3 + dx 2 + e, where c, d, and e are fractions. ( 1 11 x3 + 9 11 x2 + 23 11 ) 9 / 46

Model Selection 2 1.5 Outputs, y(x) 1 0.5 0 0.5 1 0 20 40 60 80 100 Inputs, x Observations y(x). Assume p(y(x) f (x)) N (y(x); f (x), σ 2 ). Consider polynomials of different orders. As always, observations are out of the chosen model class! Which model should we choose? f 0(x) = a 0, (3) f 1(x) = a 0 + a 1x, (4) f 2(x) = a 0 + a 1x + a 2x 2, (5). (6) f J(x) = a 0 + a 1x + a 2x 2 + + a Jx J. (7) 10 / 46

Model Selection: Occam s Hill 0.25 Marginal Likelihood (Evidence) 0.2 0.15 0.1 0.05 0 1 2 3 4 5 6 7 8 9 10 11 Model Order Marginal likelihood (evidence) as a function of model order, using an isotropic prior p(a) = N (0, σ 2 I). 11 / 46

Model Selection: Occam s Asymptote 0.25 Marginal Likelihood (Evidence) 0.2 0.15 0.1 0.05 0 1 2 3 4 5 6 7 8 9 10 11 Model Order Marginal likelihood (evidence) as a function of model order, using an anisotropic prior p(a i) = N (0, γ i ), with γ learned from the data. 12 / 46

Occam s Razor 0.25 0.25 0.2 0.2 Marginal Likelihood (Evidence) 0.15 0.1 0.05 Marginal Likelihood (Evidence) 0.15 0.1 0.05 0 1 2 3 4 5 6 7 8 9 10 11 Model Order 0 1 2 3 4 5 6 7 8 9 10 11 Model Order (a) Isotropic Gaussian Prior (b) Anisotropic Gaussian Prior For further reading, see Rasmussen and Ghahramani (2001) (Occam s Razor), Kass and Raftery (1995) (Bayes Factors), and MacKay (2003), Chapter 28. 13 / 46

Automatic Choice of Dimensionality for PCA PCA projects a d dimensional vector x into a k d dimensional space in a way that maximizes the variance of the projection. How do we choose k? 14 / 46

Probabilistic PCA Formulate dimensionality reduction as a probabilistic model: x = Let V = vi d and p(w) N (0, I k). k h jw j + m + ɛ, (8) j=1 = Hw + m + ɛ, (9) ɛ N (0, V). (10) The maximum likelihood solution for H, given data D = {x 1,... x N} is exactly equal to the PCA solution! Let s place probability distributions over H, m, integrate away from the likelihood, then use the evidence p(d k) to determine the value of k. As N, the evidence will collapse onto the true value of k. Automatically Learning the Dimensionality of PCA (Minka, 2001). 15 / 46

Automatically Learning the Dimensionality of PCA 16 / 46

Automatically Learning the Dimensionality of PCA 17 / 46

Automatically Learning the Dimensionality of PCA 18 / 46

Automatically Learning the Dimensionality of PCA 19 / 46

Automatically Learning the Dimensionality of PCA 20 / 46

Model Construction: Support and Inductive Biases Support: which datasets (hypotheses) are a priori possible. Inductive Biases: which datasets are a priori likely. Want to make the support of our model as big as possible, with inductive biases which are calibrated to particular applications, so as to not rule out potential explanations of the data, while at the same time quickly learn from a finite amount of information on a particular application. Examples (discussion and illustrations with respect to figure on slide 6): human learning and deep learning. 21 / 46

Graphical Models Open circles correspond to random variables Filled circles correspond to observed random variables (whiteboard) Small closed circles correspond to deterministic variables (whiteboard) Square boxes show factor decompositions Edges represent statistical dependencies between variables Whole model represents a joint probability distribution 22 / 46

Graphical Models (Motivation) Graphs are an intuitive way of representing and visualising the relationships between many variables. (Examples: family trees, electric circuit diagrams, neural networks). A graph allows us to abstract out the conditional independencies between variables from the details of their parametric forms? (Whiteboard). We can answer questions like Is A dependent on B given that we know the value of C? just by looking at the graph. Graphical models allow us to define the general message passing algorithms that implement probabilistic inference efficiently. Thus we can answer queries like What is p(a C = c)? without enumerating all settings of variables in the model. 23 / 46

Independencies 24 / 46

Examples of Conditional Independencies 25 / 46

Group Discussion: Conditional Independence 26 / 46

Directed Graphical Model Model represents a joint distribution Edges show dependencies Example (fully connected graph): p(a, b, c) = p(a b, c)p(b c)p(c) Is this a unique representation of p(a, b, c)? 27 / 46

Directed Graphical Model Model represents a joint distribution Edges show dependencies Example (fully connected graph): p(a, b, c) = p(a b, c)p(b c)p(c) Is this a unique representation of p(a, b, c)? For a fully connected graph: p(x 1,..., x K) = p(x K x 1,..., x K 1)... p(x 2 x 1)p(x 1) (11) 28 / 46

Sparse Directed Graphical Model Group discussion: what s the joint distribution? 29 / 46

Joint distributions For a graph with K nodes, the joint distribution is given by K p(x) = p(x k pa k ) (12) k=1 30 / 46

Example: Polynomial Regression y = w T φ(x, v) + ɛ (13) ɛ N (0, σ 2 ) (14) w N (0, α 2 ) (15) What s the graphical model defining the joint distribution p(w, y), with y = (y 1,..., y N) T? How do we use this graphical model to infer p(y D, α 2, σ 2, v)? Group discussion. 31 / 46

Conditional Independencies 32 / 46

Conditional Independencies: Tail-Tail p(a, b) = c p(a, b, c) = c p(a c)(b c)p(c) p(a)p(b) in general (16) a b (17) a and b are not marginally independent 33 / 46

Tail-Tail Observed p(a, b c) = Want to see whether p(a, b c) = p(a c)p(b c). p(a, b, c) p(c) = p(a c)p(b c)p(c) p(c) a b c = p(a c)p(b c) (18) 34 / 46

Tail-Head p(a, b) = c p(a, b, c) = c p(a)p(c a)p(b a) = c p(a, c)p(b a) (19) = p(a)p(b a) p(a)p(b) in general (20) a b. a and b are not marginally independent. 35 / 46

Tail-Head Observed Want to see whether p(a, b c) = p(a c)p(b c). p(a, b c) = p(a, b, c) p(c) = p(a)p(c a)p(b c) p(c) (21) = p(a)p(c a) p(b c) = p(a c)p(b c) (22) p(c) Therefore a b c. 36 / 46

Head-Head p(a, b) = c p(a, b, c) = c p(a)p(b)p(c a, b) = p(a)(b) (23) a is marginally independent b. 37 / 46

Head-Head Observed p(a, b c) = p(a, b, c) p(c) = p(a)p(b)p(c a, b). (24) p(c) a b c. In all other cases observing c blocked dependencies. However, here, observing c creates dependencies! This phenomenon is called explaining away (think back to the sprinkler, rain, ground example). 38 / 46

D-separation Semantics: X Y V if V d-separates X from Y. Definition: V d-separates X from Y if every undirected path from X to Y blocked by V. A path is blocked by V if there is a node W on the path such that either: 1. W has converging arrows along the path ( W ) (head-head) and neither W nor its descendants are observed (W / V), or 2. W does not have converging arrows along the path ( W or W ) (head-tail or tail-tail) and W is observed (W V). Corollary: Markov blanket of node x i : {parents children parents of children}. x i is independent of everything else conditioned on this blanket. 39 / 46

D-separation Examples Is a b c? Is a b f? How do deterministic parameters (denoted by small black circles), such as the noise variance σ 2 in our Bayesian basis regression model, behave with respect to d-separation? 40 / 46

Data sampled from a Gaussian distribution If we condition on the mean µ, the data x i are independent. But what if we look at the marginal distribution having integrated away µ? 41 / 46

Naive Inference 42 / 46

Exploiting Graph Structure for Efficiency 43 / 46

Prelude to Belief Propagation 44 / 46

Ideas behind Belief Propagation 45 / 46

Next class Up next... Belief Propagation! 46 / 46