Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September 1, 2016 1 / 46
References Bishop (2006), MacKay (2003), Rasmussen and Ghahramani (2001), Ghahramani (2015), Ghahramani (2014), Wilson (2014). 2 / 46
Bayesian Modelling (Theory of Everything) 3 / 46
Regularisation = MAP Bayesian Inference Example: Density Estimation Observations y 1,..., y N drawn from unknown density p(y). Model p(y θ) = w 1N (y µ 1, σ 2 1) + w 2N (y µ 2, σ 2 2), θ = {w 1, w 2, µ 1, µ 2, σ 1, σ 2}. Likelihood p(y θ) = N i=1 p(yi θ). Can learn all free parameters θ using maximum likelihood... 4 / 46
Regularisation = MAP Bayesian Inference Regularisation or MAP Find argmax θ log p(θ y) c = model fit {}}{ complexity penalty {}}{ log p(y θ) + log p(θ) Choose p(θ) such that p(θ) 0 faster than p(y θ) as σ 1 or σ 2 0. Bayesian Inference Predictive Distribution: p(y y) = p(y θ)p(θ y)dθ. Parameter Posterior: p(θ y) p(y θ)p(θ). p(θ) need not be zero anywhere in order to make reasonable inferences. Can use a sampling scheme, with conjugate posterior updates for each separate mixture component, using an inverse Gamma prior on the variances σ 2 1, σ 2 2. 5 / 46
Model Selection and Marginal Likelihood p(y M 1, X) = p(y f 1(x, w))p(w)dw (1) Complex Model Simple Model Appropriate Model p(y M) y All Possible Datasets 6 / 46
Model Comparison p(h 1 D) p(h = p(d H1) p(h 1) 2 D) p(d H 2) p(h. (2) 2) 7 / 46
Blackboard: Examples of Occam s Razor in Everyday Inferences For further reading, see MacKay (2003) textbook, Information Theory, Inference, and Learning Algorithms. 8 / 46
Occam s Razor Example -1, 3, 7, 11,??,?? H 1 : the sequence is an arithmetic progression, add n, where n is an integer. H 2 : the sequence is generated by a cubic function of the form cx 3 + dx 2 + e, where c, d, and e are fractions. ( 1 11 x3 + 9 11 x2 + 23 11 ) 9 / 46
Model Selection 2 1.5 Outputs, y(x) 1 0.5 0 0.5 1 0 20 40 60 80 100 Inputs, x Observations y(x). Assume p(y(x) f (x)) N (y(x); f (x), σ 2 ). Consider polynomials of different orders. As always, observations are out of the chosen model class! Which model should we choose? f 0(x) = a 0, (3) f 1(x) = a 0 + a 1x, (4) f 2(x) = a 0 + a 1x + a 2x 2, (5). (6) f J(x) = a 0 + a 1x + a 2x 2 + + a Jx J. (7) 10 / 46
Model Selection: Occam s Hill 0.25 Marginal Likelihood (Evidence) 0.2 0.15 0.1 0.05 0 1 2 3 4 5 6 7 8 9 10 11 Model Order Marginal likelihood (evidence) as a function of model order, using an isotropic prior p(a) = N (0, σ 2 I). 11 / 46
Model Selection: Occam s Asymptote 0.25 Marginal Likelihood (Evidence) 0.2 0.15 0.1 0.05 0 1 2 3 4 5 6 7 8 9 10 11 Model Order Marginal likelihood (evidence) as a function of model order, using an anisotropic prior p(a i) = N (0, γ i ), with γ learned from the data. 12 / 46
Occam s Razor 0.25 0.25 0.2 0.2 Marginal Likelihood (Evidence) 0.15 0.1 0.05 Marginal Likelihood (Evidence) 0.15 0.1 0.05 0 1 2 3 4 5 6 7 8 9 10 11 Model Order 0 1 2 3 4 5 6 7 8 9 10 11 Model Order (a) Isotropic Gaussian Prior (b) Anisotropic Gaussian Prior For further reading, see Rasmussen and Ghahramani (2001) (Occam s Razor), Kass and Raftery (1995) (Bayes Factors), and MacKay (2003), Chapter 28. 13 / 46
Automatic Choice of Dimensionality for PCA PCA projects a d dimensional vector x into a k d dimensional space in a way that maximizes the variance of the projection. How do we choose k? 14 / 46
Probabilistic PCA Formulate dimensionality reduction as a probabilistic model: x = Let V = vi d and p(w) N (0, I k). k h jw j + m + ɛ, (8) j=1 = Hw + m + ɛ, (9) ɛ N (0, V). (10) The maximum likelihood solution for H, given data D = {x 1,... x N} is exactly equal to the PCA solution! Let s place probability distributions over H, m, integrate away from the likelihood, then use the evidence p(d k) to determine the value of k. As N, the evidence will collapse onto the true value of k. Automatically Learning the Dimensionality of PCA (Minka, 2001). 15 / 46
Automatically Learning the Dimensionality of PCA 16 / 46
Automatically Learning the Dimensionality of PCA 17 / 46
Automatically Learning the Dimensionality of PCA 18 / 46
Automatically Learning the Dimensionality of PCA 19 / 46
Automatically Learning the Dimensionality of PCA 20 / 46
Model Construction: Support and Inductive Biases Support: which datasets (hypotheses) are a priori possible. Inductive Biases: which datasets are a priori likely. Want to make the support of our model as big as possible, with inductive biases which are calibrated to particular applications, so as to not rule out potential explanations of the data, while at the same time quickly learn from a finite amount of information on a particular application. Examples (discussion and illustrations with respect to figure on slide 6): human learning and deep learning. 21 / 46
Graphical Models Open circles correspond to random variables Filled circles correspond to observed random variables (whiteboard) Small closed circles correspond to deterministic variables (whiteboard) Square boxes show factor decompositions Edges represent statistical dependencies between variables Whole model represents a joint probability distribution 22 / 46
Graphical Models (Motivation) Graphs are an intuitive way of representing and visualising the relationships between many variables. (Examples: family trees, electric circuit diagrams, neural networks). A graph allows us to abstract out the conditional independencies between variables from the details of their parametric forms? (Whiteboard). We can answer questions like Is A dependent on B given that we know the value of C? just by looking at the graph. Graphical models allow us to define the general message passing algorithms that implement probabilistic inference efficiently. Thus we can answer queries like What is p(a C = c)? without enumerating all settings of variables in the model. 23 / 46
Independencies 24 / 46
Examples of Conditional Independencies 25 / 46
Group Discussion: Conditional Independence 26 / 46
Directed Graphical Model Model represents a joint distribution Edges show dependencies Example (fully connected graph): p(a, b, c) = p(a b, c)p(b c)p(c) Is this a unique representation of p(a, b, c)? 27 / 46
Directed Graphical Model Model represents a joint distribution Edges show dependencies Example (fully connected graph): p(a, b, c) = p(a b, c)p(b c)p(c) Is this a unique representation of p(a, b, c)? For a fully connected graph: p(x 1,..., x K) = p(x K x 1,..., x K 1)... p(x 2 x 1)p(x 1) (11) 28 / 46
Sparse Directed Graphical Model Group discussion: what s the joint distribution? 29 / 46
Joint distributions For a graph with K nodes, the joint distribution is given by K p(x) = p(x k pa k ) (12) k=1 30 / 46
Example: Polynomial Regression y = w T φ(x, v) + ɛ (13) ɛ N (0, σ 2 ) (14) w N (0, α 2 ) (15) What s the graphical model defining the joint distribution p(w, y), with y = (y 1,..., y N) T? How do we use this graphical model to infer p(y D, α 2, σ 2, v)? Group discussion. 31 / 46
Conditional Independencies 32 / 46
Conditional Independencies: Tail-Tail p(a, b) = c p(a, b, c) = c p(a c)(b c)p(c) p(a)p(b) in general (16) a b (17) a and b are not marginally independent 33 / 46
Tail-Tail Observed p(a, b c) = Want to see whether p(a, b c) = p(a c)p(b c). p(a, b, c) p(c) = p(a c)p(b c)p(c) p(c) a b c = p(a c)p(b c) (18) 34 / 46
Tail-Head p(a, b) = c p(a, b, c) = c p(a)p(c a)p(b a) = c p(a, c)p(b a) (19) = p(a)p(b a) p(a)p(b) in general (20) a b. a and b are not marginally independent. 35 / 46
Tail-Head Observed Want to see whether p(a, b c) = p(a c)p(b c). p(a, b c) = p(a, b, c) p(c) = p(a)p(c a)p(b c) p(c) (21) = p(a)p(c a) p(b c) = p(a c)p(b c) (22) p(c) Therefore a b c. 36 / 46
Head-Head p(a, b) = c p(a, b, c) = c p(a)p(b)p(c a, b) = p(a)(b) (23) a is marginally independent b. 37 / 46
Head-Head Observed p(a, b c) = p(a, b, c) p(c) = p(a)p(b)p(c a, b). (24) p(c) a b c. In all other cases observing c blocked dependencies. However, here, observing c creates dependencies! This phenomenon is called explaining away (think back to the sprinkler, rain, ground example). 38 / 46
D-separation Semantics: X Y V if V d-separates X from Y. Definition: V d-separates X from Y if every undirected path from X to Y blocked by V. A path is blocked by V if there is a node W on the path such that either: 1. W has converging arrows along the path ( W ) (head-head) and neither W nor its descendants are observed (W / V), or 2. W does not have converging arrows along the path ( W or W ) (head-tail or tail-tail) and W is observed (W V). Corollary: Markov blanket of node x i : {parents children parents of children}. x i is independent of everything else conditioned on this blanket. 39 / 46
D-separation Examples Is a b c? Is a b f? How do deterministic parameters (denoted by small black circles), such as the noise variance σ 2 in our Bayesian basis regression model, behave with respect to d-separation? 40 / 46
Data sampled from a Gaussian distribution If we condition on the mean µ, the data x i are independent. But what if we look at the marginal distribution having integrated away µ? 41 / 46
Naive Inference 42 / 46
Exploiting Graph Structure for Efficiency 43 / 46
Prelude to Belief Propagation 44 / 46
Ideas behind Belief Propagation 45 / 46
Next class Up next... Belief Propagation! 46 / 46