Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 3 Stochastic Gradients, Bayesian Inference, and Occam s Razor https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 30, 2016 1 / 23
Bayesian Modelling (Theory of Everything) Slide from Ghahramani (2015). 2 / 23
Worked Example: Basis Regression (Chalkboard) We have data D = {(x i, y i)} N i=1 We use the model: We want to make predictions of y for any x. y = w T φ(x) + ɛ (1) ɛ N (0, σ 2 ). (2) We will consider topics such as regularization, cross-validation, Bayesian model averaging and conjugate priors. 3 / 23
Bayesian Linear Basis Regression Results We have data D = {(x i, y i)} N i=1 We use the model: Inference y = w T φ(x) + ɛ (3) p(w α 2 ) = N (w; 0, α 2 I) (4) ɛ N (0, σ 2 ) (5) p(w y, X, α 2 ) p(y w, X)p(w α 2 ) (6) p(w y, X) = N (w; m n, S n) (7) m n = 1 σ 2 SnΦT y (8) S 1 n = 1 α 2 I + 1 σ 2 ΦT Φ (9) Predictions p(y x, D, α 2, σ 2 ) = Φ ij = φ j(x i). (10) p(y w, x, σ 2 )p(w D, α 2 ) = N (m T n φ(x ), σ n(x )) σ n(x ) = σ 2 + φ(x )S nφ(x ). 4 / 23
Bayesian Linear Basis Regression Results Learning p(y α 2, σ 2 ) = p(y w, σ 2 )p(w α)dw (11) log p(y α, σ) = n 2 log(2π) m 2 log α2 n 2 log σ2 1 log A E(mN), 2 (12) E(m N) = 1 2σ y 2 ΦmN 2 + 1 2σ 2 mt Nm N, (13) A = I α + 1 2 σ 2 ΦT Φ = E(w), (14) m N = 1 σ 2 A 1 Φ T y. (15) Procedure: Learn α 2 and γ 2 through marginal likelihood optimization. Then condition on these learned parameters to form the predictive distribution: p(y x, D, ˆα 2, ˆσ 2 ). 5 / 23
Rant: Regularisation = MAP Bayesian Inference Example: Density Estimation Observations y 1,..., y N drawn from unknown density p(y). Model p(y θ) = w 1N (y µ 1, σ 2 1) + w 2N (y µ 2, σ 2 2), θ = {w 1, w 2, µ 1, µ 2, σ 1, σ 2}. Likelihood p(y θ) = N i=1 p(yi θ). Can learn all free parameters θ using maximum likelihood... 6 / 23
Regularisation = MAP Bayesian Inference Regularisation or MAP Find argmax θ log p(θ y) c = model fit {}}{ complexity penalty {}}{ log p(y θ) + log p(θ) Choose p(θ) such that p(θ) 0 faster than p(y θ) as σ 1 or σ 2 0. Bayesian Inference Predictive Distribution: p(y y) = p(y θ)p(θ y)dθ. Parameter Posterior: p(θ y) p(y θ)p(θ). p(θ) need not be zero anywhere in order to make reasonable inferences. Can use a sampling scheme, with conjugate posterior updates for each separate mixture component, using an inverse Gamma prior on the variances σ 2 1, σ 2 2. 7 / 23
Learning with Stochastic Gradient Descent Chalkboard. 8 / 23
Model Selection and Marginal Likelihood p(y M 1, X) = p(y f 1(x, w))p(w)dw (16) Complex Model Simple Model Appropriate Model p(y M) y All Possible Datasets 9 / 23
Model Comparison p(h 1 D) p(h = p(d H1) p(h 1) 2 D) p(d H 2) p(h. (17) 2) 10 / 23
Blackboard: Examples of Occam s Razor in Everyday Inferences For further reading, see MacKay (2003) textbook, Information Theory, Inference, and Learning Algorithms. 11 / 23
Occam s Razor Example -1, 3, 7, 11,??,?? H 1 : the sequence is an arithmetic progression, add n, where n is an integer. H 2 : the sequence is generated by a cubic function of the form cx 3 + dx 2 + e, where c, d, and e are fractions. ( 1 11 x3 + 9 11 x2 + 23 11 ) 12 / 23
Model Selection 2 1.5 Outputs, y(x) 1 0.5 0 0.5 1 0 20 40 60 80 100 Inputs, x Observations y(x). Assume p(y(x) f (x)) N (y(x); f (x), σ 2 ). Consider polynomials of different orders. As always, observations are out of the chosen model class! Which model should we choose? f 0(x) = a 0, (18) f 1(x) = a 0 + a 1x, (19) f 2(x) = a 0 + a 1x + a 2x 2, (20). (21) f J(x) = a 0 + a 1x + a 2x 2 + + a Jx J. (22) 13 / 23
Model Selection: Occam s Hill 0.25 Marginal Likelihood (Evidence) 0.2 0.15 0.1 0.05 0 1 2 3 4 5 6 7 8 9 10 11 Model Order Marginal likelihood (evidence) as a function of model order, using an isotropic prior p(a) = N (0, σ 2 I). 14 / 23
Model Selection: Occam s Asymptote 0.25 Marginal Likelihood (Evidence) 0.2 0.15 0.1 0.05 0 1 2 3 4 5 6 7 8 9 10 11 Model Order Marginal likelihood (evidence) as a function of model order, using an anisotropic prior p(a i) = N (0, γ i ), with γ learned from the data. 15 / 23
Occam s Razor 0.25 0.25 0.2 0.2 Marginal Likelihood (Evidence) 0.15 0.1 0.05 Marginal Likelihood (Evidence) 0.15 0.1 0.05 0 1 2 3 4 5 6 7 8 9 10 11 Model Order 0 1 2 3 4 5 6 7 8 9 10 11 Model Order (a) Isotropic Gaussian Prior (b) Anisotropic Gaussian Prior For further reading, see Rasmussen and Ghahramani (2001) (Occam s Razor), Kass and Raftery (1995) (Bayes Factors), and MacKay (2003), Chapter 28. 16 / 23
Automatic Choice of Dimensionality for PCA PCA projects a d dimensional vector x into a k d dimensional space in a way that maximizes the variance of the projection. How do we choose k? 17 / 23
Probabilistic PCA Formulate dimensionality reduction as a probabilistic model: x = Let V = vi d and p(w) N (0, I k). k h jw j + m + ɛ, (23) j=1 = Hw + m + ɛ, (24) ɛ N (0, V). (25) The maximum likelihood solution for H, given data D = {x 1,... x N} is exactly equal to the PCA solution! Let s place probability distributions over H, m, integrate away from the likelihood, then use the evidence p(d k) to determine the value of k. As N, the evidence will collapse onto the true value of k. Automatically Learning the Dimensionality of PCA (Minka, 2001). 18 / 23
Automatically Learning the Dimensionality of PCA 19 / 23
Automatically Learning the Dimensionality of PCA 20 / 23
Automatically Learning the Dimensionality of PCA 21 / 23
Automatically Learning the Dimensionality of PCA 22 / 23
Automatically Learning the Dimensionality of PCA 23 / 23