COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Similar documents
COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Bayesian Models in Machine Learning

COMP90051 Statistical Machine Learning

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Machine Learning using Bayesian Approaches

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

INTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP

Introduction to Machine Learning

Bayesian Methods for Machine Learning

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Bayesian Learning (II)

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Why Bayesian? Rigorous approach to address statistical estimation problems. The Bayesian philosophy is mature and powerful.

ECE521 week 3: 23/26 January 2017

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

CLASS NOTES Models, Algorithms and Data: Introduction to computing 2018

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Bayesian Methods: Naïve Bayes

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

MLE/MAP + Naïve Bayes

MLE/MAP + Naïve Bayes

Discrete Binary Distributions

COMP90051 Statistical Machine Learning

Bayesian RL Seminar. Chris Mansley September 9, 2008

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Bayesian Linear Regression [DRAFT - In Progress]

Bayesian Analysis for Natural Language Processing Lecture 2

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Lecture 2: Conjugate priors

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

CPSC 540: Machine Learning

Machine Learning

INTRODUCTION TO PATTERN RECOGNITION

Lecture : Probabilistic Machine Learning

Lecture 5: GPs and Streaming regression

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Computational Perception. Bayesian Inference

Parametric Techniques Lecture 3

CSC321 Lecture 18: Learning Probabilistic Models

ECE521 W17 Tutorial 6. Min Bai and Yuhuai (Tony) Wu

PMR Learning as Inference

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

CS-E3210 Machine Learning: Basic Principles

Probability Theory and Simulation Methods

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Overfitting, Bias / Variance Analysis

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Parametric Techniques

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

PATTERN RECOGNITION AND MACHINE LEARNING

Week 3: Linear Regression

Machine Learning Lecture 2

Lecture 2: Priors and Conjugacy

CHAPTER 2 Estimating Probabilities

Introduc)on to Bayesian methods (con)nued) - Lecture 16

Machine Learning

Introduction to Machine Learning

Computational Cognitive Science

CPSC 340: Machine Learning and Data Mining

Density Estimation. Seungjin Choi

Probability and Estimation. Alan Moses

A Brief Review of Probability, Bayesian Statistics, and Information Theory

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Bayesian Inference. p(y)

Lecture 11: Probability Distributions and Parameter Estimation

Bayesian Machine Learning

Lecture 18: Learning probabilistic models

Introduction to Machine Learning CMU-10701

Introduction to Probabilistic Machine Learning

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Machine Learning

Probability Theory for Machine Learning. Chris Cremer September 2015

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

STA414/2104 Statistical Methods for Machine Learning II

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Non-parametric Methods

Probabilistic and Bayesian Machine Learning

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

Bayesian Inference. Introduction

CS 6140: Machine Learning Spring 2016

Machine Learning

Ways to make neural networks generalize better

Homework 5: Conditional Probability Models

Machine Learning Lecture 2

Machine Learning Gaussian Naïve Bayes Big Picture

Computational Cognitive Science

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Linear Models for Regression CS534

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Introduction to Bayesian Learning. Machine Learning Fall 2018

CS 361: Probability & Statistics

Transcription:

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference Associate Instructor: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp551 Unless otherwise noted, all material posted for this course are copyright of the instructor, and cannot be reused or reposted without the instructor s written permission.

Slides Temporarily available at: http://cs.mcgill.ca/~hvanho2/media/19bayesianinference.pdf Quiz Will be online by tonight 2

Bayesian probabilities An example from regression Given few noisy data points, multiple models conceivable Can we quantify uncertainty over models using probabilities? Copyright C.M. Bishop, PRML 3

Bayesian probabilities An example from regression Given few noisy data points, multiple models conceivable Can we quantify uncertainty over models using probabilities? Classical / frequentist statistics: no Probability represents frequency of repeatable event There is only one true model, we cannot observe multiple realisations of the true model Copyright C.M. Bishop, PRML 4

Bayesian probabilities Bayesian view of probability Uses probability to represent uncertainty Well-founded When manipulating uncertainty, certain rules need to be respected to make rational choices These rules are equivalent to the rules of probability 5

Goals of the lecture At the end of the lecture, you are able to Formulate Bayesian view on probability Give reasons for (and against) Bayesian methods are used Understand Bayesian inference and prediction steps Give some examples with analytical solutions Use posterior and predictive distributions in decision making 6

Bayesian probabilities To specify uncertainty, need to specify a model Prior over model parameters Likelihood term p(w) p(d w) Dataset D = {(x 1,y 1 ),...,(x N,y N )} Inference using Bayes theorem p(w D) = p(d w)p(w) p(d) 7

Bayesian probabilities Predictions p(y x, D) = p(y x, D) = Z Z R p(y, w x, D)dw R N p(w D)p(y x, w)dw Rather than fixing a fixed value for parameters, integrate over all possible parameter values! 8

Bayesian probabilities Note: that Bayes theorem is used does not mean a method uses a Bayesian view on probabilities! Bayes theorem is a consequence of the sum and product rules of probability Can relate the conditional probabilities of repeatable random events Alarm vs. burglary Many frequentist methods refer to Bayes theorem (naive Bayes, Bayesian networks) Bayesian view on probability: Can represent uncertainty (in parameters, unique events) using probability 9

Bayesian probabilities Randall Munroe / xkcd.com 10

Why Bayesian probabilities? Maximum likelihood estimates can have large variance Overfitting in e.g. linear regression models MLE of coin flip probabilities with three sequential heads 11

Why Bayesian probabilities? Maximum likelihood estimates can have large variance We might desire or need an estimate of uncertainty Use uncertainty in decision making Knowing uncertainty important for many loss functions Use uncertainty to decide which data to acquire (active learning, experimental design) 12

Why Bayesian probabilities? Maximum likelihood estimates can have large variance We might desire or need an estimate of uncertainty Have small dataset, unreliable data, or small batches of data Account for reliability of different pieces of evidence Possible to update posterior incrementally with new data Variance problem especially bad with small data sets 13

Why Bayesian probabilities? Maximum likelihood estimates can have large variance We might desire or need an estimate of uncertainty Have small dataset, unreliable data, or small batches of data Use prior knowledge in a principled fashion 14

Why Bayesian probabilities? Maximum likelihood estimates can have large variance We might desire or need an estimate of uncertainty Have small dataset, unreliable data, or small batches of data Use prior knowledge in a principled fashion In practice, using prior knowledge and uncertainty particularly makes difference with small data sets 15

Why not Bayesian probabilities? Prior induces bias Misspecified priors: if prior is wrong, posterior can be far off Prior often chosen for mathematical convenience, not actually knowledge of the problem In contrast to frequentist probability, uncertainty is subjective, different between different people / agents 16

Algorithms for Bayesian inference What do we need to do? Dataset, e.g. Inference Prediction D = {(x 1,y 1 ),...,(x N,y N )} p(w D) = p(d w)p(w) p(d) p(y x, D) = Z R N p(w D)p(y x, w)dw When can we do these steps (in closed form)? 17

Inference Algorithms for Bayesian inference p(w D) = p(d w)p(w) p(d) Posterior can act like a prior p(w D 1, D 2 )= p(d 2 w)p(w D 1 ) p(d 2 ) Desirable that posterior and prior have same family! Otherwise posterior would get more complex with each step Such priors are called conjugate priors to a likelihood function 18

Algorithms for Bayesian inference Prediction p(y x, D) = Z R N p(w D)p(y x, w)dw same family as prior Argument of the integral is unnormalised distribution over w Integral calculates the normalisation constant For prior conjugate to likelihood function, constant is known 19

Algorithms for Bayesian inference Not all likelihood functions have conjugate priors However, so-called exponential family distributions do Normal Exponential Beta Bernoulli Categorical 20

Simple example: coin toss Flip unfair coin Probability of heads unknown value r Likelihood: Bern(x r) =r x (1 r) 1 x x is one ( heads ) or zero ( tails ) r is unknown parameter, between 0 and 1 likelihood for x=1 r Copyright C.M. Bishop, PRML 21

Conjugate prior: Simple example: coin toss Beta(r a, b) = (a + b) (a) (b) ra 1 (1 r) b 1 r r Copyright C.M. Bishop, PRML r r 22

Conjugate prior: Simple example: coin toss Beta(r a, b) = (a + b) (a) (b) ra 1 (1 r) b 1 Prior denotes a priori belief over the value r r is a value between 0 and 1 (denotes prob. of heads or tails) a, b are hyperparameters Copyright C.M. Bishop, PRML r coin probably more likely to give tails no idea about the fairness r 23

Simple example: coin toss Model: Likelihood: Conjugate prior: Beta(r a, b) = Bern(x r) =r x (1 r) 1 x (a + b) (a) (b) ra 1 (1 r) b 1 Posterior = prior x likelihood / normalisation factor Note the similarity in the factors p(r x) =z 1 r a+x 1 (1 r) b x normalization factor again beta distribution 24

Simple example: coin toss Posterior: p(r x) =z 1 r a+x 1 (1 r) b x r r r We observe more heads -> suspect more strongly coin is biased Note that a, b get added to the actual outcome: pseudo-observations Updated a,b can now be used as working prior for the next coin flip Copyright C.M. Bishop, PRML 25

Simple example: coin toss Posterior: Prediction: p(r x) =z 1 r a+x 1 (1 r) b x p(x =1 D) = Instead of taking one parameter value, average over all of them a, b, again interpretable as effective # observations = Z 1 Consider the difference if a=b=1, #heads=1, #tails=0 0 p(x =1 r)p(r D)dr likelihood posterior #heads + a #heads + #tails + a + b 26

Simple example: coin toss Posterior: Prediction: p(r x) =z 1 r a+x 1 (1 r) b x p(x =1 D) = Instead of taking one parameter value, average over all of them a, b, again interpretable as effective # observations Consider the difference if a=b=1, #heads=1, #tails=0 = Z 1 Note that as #flips increases, prior starts to matter less 0 p(x =1 r)p(r D)dr likelihood posterior #heads + a #heads + #tails + a + b 27

Simple example: coin toss Instead of taking one parameter value, average over all of them True for all Bayesian models Hyperparameters interpretable as effective # observations True for many Bayesian models (depends on parametrization) As amount of data increases, prior starts to matter less True for all Bayesian models 28

Example 2: mean of a 1d Gaussian Try to learn the mean of a Gaussian distribution Model: Likelihood Conjugate prior p(y) =N (µ, 2 ) p(µ) =N (0, 1 ) Assume variances of the distributions are known We know the mean is close to zero but not its exact value 29

Example 2: inference for Gaussian From the shape of the distributions we see again some similarity: log likelihood log conjugate prior const const 1 (y µ) 2 2 2 1 2 µ2 Now find log posterior 30

(y µ) 2 Inference for Gaussian const 1 2 (y µ) 2 + µ 2 2 = 2 yµ 2 + µ2 2 + µ2 + const 2 + µ 2 = yµ 2 + µ2 2 + µ2 + const = 2 yµ 2 +( + 2 )µ 2 + const = yµ 2 +( + 2 )µ 2 + const + 2 1 = 2 + 2 + 2 yµ 1 +( + 2 )µ 2 + const + 2 2 2 + y µ = 2 ( + 2 ) 1 + const 2 yµ +( + 2 )µ 2 + const mean of posterior distribution of µ : between MLE (y) and paprior (0) covariance of posterior: smaller than either covariance of likelihood or prior 31

Inference for Gaussian Copyright C.M. Bishop, PRML 32

Prediction p(y D) = = = Prediction for Gaussian Z 1 1 Z 1 1 Z 1 1 p(y,µ D)dµ p(y µ)p(µ D)dµ N (y µ, 2 )N µ + Convolution of Gaussians, can be solved in closed form p(y D) =N y 2 + 2 y train, 2 1 + + 2 2 2 y train, 1 + 2 dµ noise + parameter uncertainty 33

Bayesian linear regression More complex example: Bayesian linear regression Model: Likelihood p(y x, w) =N (w T x, 2 ) Conjugate prior p(w) =N (0, 1 I) Prior precision and noise variance considered known Linear regression with uncertainty about the parameters 2 34