Bayesian Approach 2. CSC412 Probabilistic Learning & Reasoning

Similar documents
COS513 LECTURE 8 STATISTICAL CONCEPTS

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Bayesian Models in Machine Learning

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Based on slides by Richard Zemel

STA 4273H: Statistical Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Lecture : Probabilistic Machine Learning

PMR Learning as Inference

CSC321 Lecture 18: Learning Probabilistic Models

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Machine Learning Lecture 3

Density Estimation. Seungjin Choi

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

But if z is conditioned on, we need to model it:

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Machine Learning Lecture Notes

STA 4273H: Statistical Machine Learning

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Lecture 4: Probabilistic Learning

Learning Bayesian network : Given structure and completely observed data

4: Parameter Estimation in Fully Observed BNs

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Introduc)on to Bayesian Methods

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

CS Lecture 18. Expectation Maximization

Bayesian Methods: Naïve Bayes

Bayesian Learning (II)

STA414/2104 Statistical Methods for Machine Learning II

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

Lecture 3: Pattern Classification. Pattern classification

Introduction to Bayesian inference

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Covariance. if X, Y are independent

The Bayes classifier

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Latent Variable Models and EM algorithm

NPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Probabilistic Graphical Models

Machine Learning 4771

Machine Learning Lecture 2

2. A Basic Statistical Toolbox

Lecture 6: Graphical Models: Learning

GWAS IV: Bayesian linear (variance component) models

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Lecture 18: Learning probabilistic models

CPSC 340: Machine Learning and Data Mining

Machine Learning Techniques for Computer Vision

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CS 361: Probability & Statistics

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

p L yi z n m x N n xi

Machine Learning Summer School

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Lecture 2: Priors and Conjugacy

MLE/MAP + Naïve Bayes

STA 4273H: Sta-s-cal Machine Learning

Hierarchical Models & Bayesian Model Selection

Foundations of Statistical Inference

Recent Advances in Bayesian Inference Techniques

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

13: Variational inference II

Bayesian Machine Learning

Logistic Regression. Machine Learning Fall 2018

Notes on Discriminant Functions and Optimal Classification

10-701/15-781, Machine Learning: Homework 4

Where now? Machine Learning and Bayesian Inference

Introduction to Machine Learning. Lecture 2

Lecture 2: Parameter Learning in Fully Observed Graphical Models. Sam Roweis. Monday July 24, 2006 Machine Learning Summer School, Taiwan

Gaussian Mixture Model

STA 4273H: Statistical Machine Learning

Machine Learning. Probability Basics. Marc Toussaint University of Stuttgart Summer 2014

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Introduction to Probabilistic Machine Learning

Lecture 2: Simple Classifiers

Generative Clustering, Topic Modeling, & Bayesian Inference

Curve Fitting Re-visited, Bishop1.2.5

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

T Machine Learning: Basic Principles

Bayesian Analysis for Natural Language Processing Lecture 2

Bayesian Methods for Machine Learning

Machine Learning Lecture 2

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Mobile Robot Localization

Overfitting, Bias / Variance Analysis

Artificial Neural Networks 2

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Introduction to Bayesian Learning. Machine Learning Fall 2018

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Non-Parametric Bayes

Generative Models for Discrete Data

Artificial Intelligence

STA 4273H: Statistical Machine Learning

Machine Learning 4771

Probability and Estimation. Alan Moses

Discrete Binary Distributions

Lecture 2: Conjugate priors

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

Transcription:

CSC412 Probabilistic Learning & Reasoning Lecture 12: Bayesian Parameter Estimation February 27, 2006 Sam Roweis Bayesian Approach 2 The Bayesian programme (after Rev. Thomas Bayes) treats all unnown quantities as random variables and represents uncertainty over those quantities using probability distributions. Thus, unnown parameters are treated as random variables just lie latent (hidden) variables or missing data. This means we have probability distributions over the parameters. We can (and should) put priors p() over them, and can compute things lie posteriors p( D). Crucially, we want to integrate/sum out all unobserved quantities (even parameters) just as we did with things lie cluster assignment variables or continuous latent factors. Probability vs. Statistics 1 Probability: inferring probabilistic quantities from partial data given fied models (e.g. marginals, conditionals, log lielihood). Statistics: inferring a model given fied data observations (e.g. clustering, classification, regression). Many approaches to statistics. We have focused on (regularized) maimum lielihood. A B y Plates 3 Since Bayesian methods treat parameters as random variables, we would lie to include them into the graphical model. One way to do this is to repeat all the iid observations eplicitly and show the parameter only once. A better way is to use plates, in which repeated quantities that are iid are put in a bo. X1 X 2 X 3 X X n (a) (b)

Plates are Macros for Repeated Structure 4 Plates are lie macros that allow you to draw a very complicated graphical model with a simpler notation. The rules of plates are simple: repeat every structure in a bo a number of times given by the integer in the corner of the bo (e.g. ), updating the plate inde variable (e.g. n) as you go. Duplicate every arrow going into the plate and every arrow leaving the plate by connecting the arrows to each copy of the structure. X1 X 2 X 3 X X n Posterior Over Parameters 6 If is a random variable, we can view the lielihood as a conditional probability and use Bayes rule: p( D) = p(d )p() p(d) This crucial equation can be written in words: posterior lielihood prior Computing the posterior requires conditioning on the data and having a prior over parameters. In contrast, frequentists consider various estimators of and hope to show that they have desirable properties, e.g. ML, unbiased, minimum variance, etc. (a) (b) ested/intersecting Plates 5 Model Averaging 7 Plates can be nested, in which case there arrows get duplicated also, according to the rule: draw an arrow from every copy of the source node to every copy of the destination node. Plates can also cross (intersect), in which case the nodes at the intersection have multiple indices and get duplicated a number of times equal to the product of the duplication numbers on all the plates containing them. K a n y n w nj J z nj Posterior p( D) is used in all future Bayesian computations. For eample, to do prediction of a new value new given some iid data, we compute the conditional posterior: p( new X) = p( new, X)d = p( new,x)p( X)d = p( new )p( X)d This means the Bayesian prediction is based on averaging predictions from lots of models, weighted by the posterior probability of the model s parameters. X Xnew

Multiple Model Averaging 8 Imagine that we wanted to compute the probability of some new data (e.g. the density of a new point) taing into account the predictions of all models. We can compute: p( new X) = p( new,,m X)ddm = p( new,m,x)p(,m X)ddm = p( new,m,x)p( m,x)p(m X)ddm This requires two posteriors, p(m X) (see later) and p(,m). Remember: maimum lielihood alone cannot be used to do either model selection or model averaging since it always is subject to overfitting. Bayesian methods in principle can never overfit, since we integrate over all unnown quantities. Integrate or Optimize? 10 X ormally, Bayesian statistics needs to perform an integral in order to do predictions. Frequentist statistics uses a plug-in estimator such as ML. We can be pseudo-bayesian by using single estimators such as Bayes-point or MAP. otice that both the Bayesian approach and the ML (frequentist) approach need to calculate the lielihood function p( ), which is what the graphical model specifies. So all the wor we have done so far to is applicable to both Bayesian and ML framewors. Xnew ML vs. Maimum-A-Posteriori (MAP) 9 If we forced a Bayesian to pic a single value for the parameters rather use the entire posterior p( D), what would they do? Bayes point (mean of posterior): Bayes = MAP (mode of posterior): p( D)d MAP = argma p( D) = argma log p(d ) + log p() The maimum a-posteriori (MAP) estimate loos eactly mamimum lielihood ecept for an etra term which depends only on the parameters. This is often called penalized maimum lielihood, and it s what we ve been studying in this course so far. Eample: Scalar Gaussian Model 11 Consider a univariate Gaussian, with fied, nown variance σ 2. We want to put a prior p() on the mean, and then compute its posterior, p( X) using the Gaussian lielihood p(x ). What should the prior be? Try another Gaussian: p() = 1 { 2πτ 2 ep 1 } 2τ 2( 0) 2 = ( 0,τ) ow the joint probability can be written as: p(x, ) = p(x )p() = 1 2πσ 2 ep 1 { 2σ 2 ( n ) 2 1 2πτ n=1 2 ep 1 } 2τ 2( 0) 2 We need to marginalize this joint with respect to to obtain the posterior p( X). This normalization can be done using the conditional Gaussian formulas or by eplicitly completing the square.

Scalar Gaussian: Posterior over the Mean 12 Amazingly, the posterior is another Gaussian: p( X) = p(x )p() p(x,)d = 1 { 2πs 2 ep 1 } 2s2( m)2 = ( m,s 2 ) /σ 2 1/τ 2 m = /σ 2 + 1/τ 2 ML + /σ 2 + 1/τ 2 0 ( s 2 = σ 2 + 1 ) 1 τ 2 where ML is the sample mean. Eample: Multinomial 14 Bayesian methods can also be used to estimate the density of discrete quantities (e.g. spam/nospam, shoe colour). If we use a multinomial distribution over K settings as the lielihood model, the conjugate prior is called the Dirichlet distribution defined as: p() = C(α) α 1 1 1 α 2 1 2... α K 1 K C(α) = Γ( α )/ Γ(α ) where Γ( ) is the gamma function and the α 1 is a convention. This is a funny density, because it is a density over the simple, i.e. over vectors whose components are non-negative and sum to one. In the binary case, the multinomial becomes a binomial p( ) = (1 ) 1 and the Dirichlet becomes a beta distribution p() = C(α) α 1 1 (1 ) α 2 1. Conjugate Priors 13 In the eample we just wored out, the posterior had the same form as the prior (both were Gaussian). When this happens, the prior is called the conjugate prior for the parameters with respect to the lielihood function. Conjugate priors are very nice to wor with because the posterior and prior have the same parameter types and the effect of the data is just to update the parameters from the prior to the posterior. In these settings, the prior can often be interpreted as some pseudo-data which we observed before we saw the real data. Remember Laplace smoothing? That s just a pseudo-count of unity, which in turn is just a conjugate prior for the multinomial... Multinomial Posterior 15 The posterior is also of the form of a Dirichlet: p( X) p(x )p() = n [ n=] α 1 = α 1+ n [ n=] which has parameters α = α + n [ n = ]. We see that to update the prior into a posterior, we simply add the observed counts to the priors. So we can thin of the priors as pseudo-counts.

Heirarchical Bayes and Structure Learning 16 What about the parameters of the parameter priors? In a full Bayesian formulation, they also have priors, called hyperpriors and we treat them in the same way. In theory we should do this upwards forever, but in practice we usually stop after only one or two levels. 2 τ α What about the model structure? In a full Bayesian formulation we also have a prior on that, and attempt to get a posterior. Sometimes this can be done (e.g. fully observed tree learning was just maimum lielihood over structure). Xn Y n σ 2 β Practical Bayesian and Other Methods 18 Often the integrals required by correct Bayesian reasoning are computationally intractable, and so we resort to approimations, such as sampling, variational methods, large sample approimations (BIC,AIC,MDL), tree-structured bounds, etc. There are also many non-bayesian methods for model selection and capacity control, such as ernel machines, locally weighted modeling and a very popular and powerful class of algorithms nown (oddly) as non-parametric or semiparametric approaches to solving estimation problems. For these adventures, see courses by Prof. eal and Prof. Boutilier. Model Selection 17 In principle we could do model structure learning in a Bayesian way also. Consider a fied class of models, indeed by m = 1...M (e.g. Gaussian mitures with m components). Since m is unnown, the Bayesian way is to treat it as a random variable and to compute its posterior: p(m X) = p(x m)p(m) p(x) otice that we require a prior p(m) on models as well as the marginal lielihood: p(x m) = p(x, m)d = p(x, m)p( m)d We could try to compute the model with the highest posterior, in which case we don t have to compute p(x). Or else we could use all of the models, weighted by their posteriors to do predictions at test time. This was called model averaging.