Bayesian Machine Learning

Similar documents
Bayesian Machine Learning

Bayesian Machine Learning

Probabilistic Graphical Models Lecture 20: Gaussian Processes

20: Gaussian Processes

Lecture : Probabilistic Machine Learning

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Lecture 4: Probabilistic Learning

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Bayesian RL Seminar. Chris Mansley September 9, 2008

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Introduction into Bayesian statistics

Introduction to Machine Learning

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Unsupervised Learning

Density Estimation. Seungjin Choi

Probabilistic & Bayesian deep learning. Andreas Damianou

Linear Models for Regression

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Probabilistic Graphical Models & Applications

Expectation Propagation for Approximate Bayesian Inference

CS-E3210 Machine Learning: Basic Principles

Probabilistic Reasoning in Deep Learning

STA414/2104 Statistical Methods for Machine Learning II

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Linear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0.

Bayesian Deep Learning

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Lecture 6: Graphical Models: Learning

CPSC 540: Machine Learning

Probabilistic & Unsupervised Learning

PMR Learning as Inference

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

GAUSSIAN PROCESS REGRESSION

Introduction to Machine Learning

GWAS IV: Bayesian linear (variance component) models

Lecture 7 and 8: Markov Chain Monte Carlo

An Introduction to Statistical and Probabilistic Linear Models

Nonparameteric Regression:

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Gaussian Process Regression

Introduction to Probabilistic Machine Learning

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

STA 4273H: Sta-s-cal Machine Learning

Part 1: Expectation Propagation

Learning Gaussian Process Models from Uncertain Data

Machine Learning Summer School

Bayesian Inference and MCMC

CPSC 540: Machine Learning

Introduction to Machine Learning

Data Mining Techniques

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

Bayesian Linear Regression. Sargur Srihari

Generative Clustering, Topic Modeling, & Bayesian Inference

Machine Learning Basics: Maximum Likelihood Estimation

Introduction to Bayesian Learning. Machine Learning Fall 2018

CSC321 Lecture 18: Learning Probabilistic Models

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Variational Scoring of Graphical Model Structures

Model Selection for Gaussian Processes

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Kernel methods, kernel SVM and ridge regression

Gaussian Processes for Machine Learning

Non-Parametric Bayes

Bayesian Regression Linear and Logistic Regression

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Graphical Models for Collaborative Filtering

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Introduction to Bayesian inference

Overfitting, Bias / Variance Analysis

PATTERN RECOGNITION AND MACHINE LEARNING

Ch 4. Linear Models for Classification

The Expectation-Maximization Algorithm

A Process over all Stationary Covariance Kernels

Linear Models A linear model is defined by the expression

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Naïve Bayes classification

Reading Group on Deep Learning Session 1

Regression with Numerical Optimization. Logistic

STA 414/2104, Spring 2014, Practice Problem Set #1

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction to Logistic Regression and Support Vector Machine

Linear Models for Regression

Overview c 1 What is? 2 Definition Outlines 3 Examples of 4 Related Fields Overview Linear Regression Linear Classification Neural Networks Kernel Met

Bayesian methods in economics and finance

Lecture 6: Model Checking and Selection

Nonparametric Bayesian Methods (Gaussian Processes)

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati

Linear Models for Regression

Bayesian Hidden Markov Models and Extensions

Transcription:

Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 3 Stochastic Gradients, Bayesian Inference, and Occam s Razor https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 30, 2016 1 / 23

Bayesian Modelling (Theory of Everything) Slide from Ghahramani (2015). 2 / 23

Worked Example: Basis Regression (Chalkboard) We have data D = {(x i, y i)} N i=1 We use the model: We want to make predictions of y for any x. y = w T φ(x) + ɛ (1) ɛ N (0, σ 2 ). (2) We will consider topics such as regularization, cross-validation, Bayesian model averaging and conjugate priors. 3 / 23

Bayesian Linear Basis Regression Results We have data D = {(x i, y i)} N i=1 We use the model: Inference y = w T φ(x) + ɛ (3) p(w α 2 ) = N (w; 0, α 2 I) (4) ɛ N (0, σ 2 ) (5) p(w y, X, α 2 ) p(y w, X)p(w α 2 ) (6) p(w y, X) = N (w; m n, S n) (7) m n = 1 σ 2 SnΦT y (8) S 1 n = 1 α 2 I + 1 σ 2 ΦT Φ (9) Predictions p(y x, D, α 2, σ 2 ) = Φ ij = φ j(x i). (10) p(y w, x, σ 2 )p(w D, α 2 ) = N (m T n φ(x ), σ n(x )) σ n(x ) = σ 2 + φ(x )S nφ(x ). 4 / 23

Bayesian Linear Basis Regression Results Learning p(y α 2, σ 2 ) = p(y w, σ 2 )p(w α)dw (11) log p(y α, σ) = n 2 log(2π) m 2 log α2 n 2 log σ2 1 log A E(mN), 2 (12) E(m N) = 1 2σ y 2 ΦmN 2 + 1 2σ 2 mt Nm N, (13) A = I α + 1 2 σ 2 ΦT Φ = E(w), (14) m N = 1 σ 2 A 1 Φ T y. (15) Procedure: Learn α 2 and γ 2 through marginal likelihood optimization. Then condition on these learned parameters to form the predictive distribution: p(y x, D, ˆα 2, ˆσ 2 ). 5 / 23

Rant: Regularisation = MAP Bayesian Inference Example: Density Estimation Observations y 1,..., y N drawn from unknown density p(y). Model p(y θ) = w 1N (y µ 1, σ 2 1) + w 2N (y µ 2, σ 2 2), θ = {w 1, w 2, µ 1, µ 2, σ 1, σ 2}. Likelihood p(y θ) = N i=1 p(yi θ). Can learn all free parameters θ using maximum likelihood... 6 / 23

Regularisation = MAP Bayesian Inference Regularisation or MAP Find argmax θ log p(θ y) c = model fit {}}{ complexity penalty {}}{ log p(y θ) + log p(θ) Choose p(θ) such that p(θ) 0 faster than p(y θ) as σ 1 or σ 2 0. Bayesian Inference Predictive Distribution: p(y y) = p(y θ)p(θ y)dθ. Parameter Posterior: p(θ y) p(y θ)p(θ). p(θ) need not be zero anywhere in order to make reasonable inferences. Can use a sampling scheme, with conjugate posterior updates for each separate mixture component, using an inverse Gamma prior on the variances σ 2 1, σ 2 2. 7 / 23

Learning with Stochastic Gradient Descent Chalkboard. 8 / 23

Model Selection and Marginal Likelihood p(y M 1, X) = p(y f 1(x, w))p(w)dw (16) Complex Model Simple Model Appropriate Model p(y M) y All Possible Datasets 9 / 23

Model Comparison p(h 1 D) p(h = p(d H1) p(h 1) 2 D) p(d H 2) p(h. (17) 2) 10 / 23

Blackboard: Examples of Occam s Razor in Everyday Inferences For further reading, see MacKay (2003) textbook, Information Theory, Inference, and Learning Algorithms. 11 / 23

Occam s Razor Example -1, 3, 7, 11,??,?? H 1 : the sequence is an arithmetic progression, add n, where n is an integer. H 2 : the sequence is generated by a cubic function of the form cx 3 + dx 2 + e, where c, d, and e are fractions. ( 1 11 x3 + 9 11 x2 + 23 11 ) 12 / 23

Model Selection 2 1.5 Outputs, y(x) 1 0.5 0 0.5 1 0 20 40 60 80 100 Inputs, x Observations y(x). Assume p(y(x) f (x)) N (y(x); f (x), σ 2 ). Consider polynomials of different orders. As always, observations are out of the chosen model class! Which model should we choose? f 0(x) = a 0, (18) f 1(x) = a 0 + a 1x, (19) f 2(x) = a 0 + a 1x + a 2x 2, (20). (21) f J(x) = a 0 + a 1x + a 2x 2 + + a Jx J. (22) 13 / 23

Model Selection: Occam s Hill 0.25 Marginal Likelihood (Evidence) 0.2 0.15 0.1 0.05 0 1 2 3 4 5 6 7 8 9 10 11 Model Order Marginal likelihood (evidence) as a function of model order, using an isotropic prior p(a) = N (0, σ 2 I). 14 / 23

Model Selection: Occam s Asymptote 0.25 Marginal Likelihood (Evidence) 0.2 0.15 0.1 0.05 0 1 2 3 4 5 6 7 8 9 10 11 Model Order Marginal likelihood (evidence) as a function of model order, using an anisotropic prior p(a i) = N (0, γ i ), with γ learned from the data. 15 / 23

Occam s Razor 0.25 0.25 0.2 0.2 Marginal Likelihood (Evidence) 0.15 0.1 0.05 Marginal Likelihood (Evidence) 0.15 0.1 0.05 0 1 2 3 4 5 6 7 8 9 10 11 Model Order 0 1 2 3 4 5 6 7 8 9 10 11 Model Order (a) Isotropic Gaussian Prior (b) Anisotropic Gaussian Prior For further reading, see Rasmussen and Ghahramani (2001) (Occam s Razor), Kass and Raftery (1995) (Bayes Factors), and MacKay (2003), Chapter 28. 16 / 23

Automatic Choice of Dimensionality for PCA PCA projects a d dimensional vector x into a k d dimensional space in a way that maximizes the variance of the projection. How do we choose k? 17 / 23

Probabilistic PCA Formulate dimensionality reduction as a probabilistic model: x = Let V = vi d and p(w) N (0, I k). k h jw j + m + ɛ, (23) j=1 = Hw + m + ɛ, (24) ɛ N (0, V). (25) The maximum likelihood solution for H, given data D = {x 1,... x N} is exactly equal to the PCA solution! Let s place probability distributions over H, m, integrate away from the likelihood, then use the evidence p(d k) to determine the value of k. As N, the evidence will collapse onto the true value of k. Automatically Learning the Dimensionality of PCA (Minka, 2001). 18 / 23

Automatically Learning the Dimensionality of PCA 19 / 23

Automatically Learning the Dimensionality of PCA 20 / 23

Automatically Learning the Dimensionality of PCA 21 / 23

Automatically Learning the Dimensionality of PCA 22 / 23

Automatically Learning the Dimensionality of PCA 23 / 23