Parametric Techniques Lecture 3

Similar documents
Parametric Techniques

Bayesian Decision and Bayesian Learning

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

COS513 LECTURE 8 STATISTICAL CONCEPTS

Density Estimation. Seungjin Choi

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

p(d θ ) l(θ ) 1.2 x x x

CSC321 Lecture 18: Learning Probabilistic Models

Lecture 4: Probabilistic Learning

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Lecture : Probabilistic Machine Learning

6.867 Machine Learning

Nonparametric Methods Lecture 5

6.867 Machine Learning

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

COM336: Neural Computing

Maximum Likelihood Estimation. only training data is available to design a classifier

Minimum Error-Rate Discriminant

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Statistics: Learning models from data

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Bayesian Decision Theory Lecture 2

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Introduction to Machine Learning. Lecture 2

Bayesian Machine Learning

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Introduction to Machine Learning

Lecture 2 Machine Learning Review

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Introduction to Machine Learning

Introduction to Machine Learning

CS 195-5: Machine Learning Problem Set 1

The Expectation-Maximization Algorithm

Bayesian Methods: Naïve Bayes

GAUSSIAN PROCESS REGRESSION

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

CLASS NOTES Models, Algorithms and Data: Introduction to computing 2018

Learning Bayesian network : Given structure and completely observed data

Linear Models A linear model is defined by the expression

PMR Learning as Inference

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Naïve Bayes classification

Data Analysis and Uncertainty Part 2: Estimation

Probabilistic Graphical Models

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

ECE531 Lecture 10b: Maximum Likelihood Estimation

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Support Vector Machines and Bayes Regression

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Lecture 7 Introduction to Statistical Decision Theory

Hierarchical Models & Bayesian Model Selection

Introduction to Probabilistic Machine Learning

Density Estimation: ML, MAP, Bayesian estimation

EIE6207: Maximum-Likelihood and Bayesian Estimation

Bayesian RL Seminar. Chris Mansley September 9, 2008

Expect Values and Probability Density Functions

Bayesian Learning (II)

Algorithm Independent Topics Lecture 6

Lecture 8 October Bayes Estimators and Average Risk Optimality

COMS 4721: Machine Learning for Data Science Lecture 1, 1/17/2017

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

STAT 730 Chapter 4: Estimation

COMP90051 Statistical Machine Learning

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

STA 4273H: Statistical Machine Learning

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Machine Learning 2017

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

y Xw 2 2 y Xw λ w 2 2

Pattern Recognition. Parameter Estimation of Probability Density Functions

Lecture 8: Information Theory and Statistics

Introduction to Maximum Likelihood Estimation

p(x ω i 0.4 ω 2 ω

T Machine Learning: Basic Principles

p(x ω i 0.4 ω 2 ω

Review of Maximum Likelihood Estimators

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Estimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators

Parameter Estimation. Industrial AI Lab.

The Jackknife-Like Method for Assessing Uncertainty of Point Estimates for Bayesian Estimation in a Finite Gaussian Mixture Model

David Giles Bayesian Econometrics

Machine Learning Lecture 2

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

1 Degree distributions and data

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

ECE521 week 3: 23/26 January 2017

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

Regression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

PATTERN RECOGNITION AND MACHINE LEARNING

MODULE -4 BAYEIAN LEARNING

CMU-Q Lecture 24:

Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

Transcription:

Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39

Introduction In Lecture 2, we learned how to form optimal decision boundaries when the full probabilistic structure of the problem is known. However, this is rarely the case in practice. Instead, we have some knowledge of the problem and some example data and we must estimate the probabilities. Focus of this lecture is to study a pair of techniques for estimating the parameters of the likelihood models (given a particular form of likelihood function, such as a Gaussian). J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 2 / 39

Introduction In Lecture 2, we learned how to form optimal decision boundaries when the full probabilistic structure of the problem is known. However, this is rarely the case in practice. Instead, we have some knowledge of the problem and some example data and we must estimate the probabilities. Focus of this lecture is to study a pair of techniques for estimating the parameters of the likelihood models (given a particular form of likelihood function, such as a Gaussian). Parametric Models For a particular class ω i, we consider a set of parameters θ i to fully define the likelihood model. For the Guassian, θ i = (µ i, Σ i ). J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 2 / 39

Introduction In Lecture 2, we learned how to form optimal decision boundaries when the full probabilistic structure of the problem is known. However, this is rarely the case in practice. Instead, we have some knowledge of the problem and some example data and we must estimate the probabilities. Focus of this lecture is to study a pair of techniques for estimating the parameters of the likelihood models (given a particular form of likelihood function, such as a Gaussian). Parametric Models For a particular class ω i, we consider a set of parameters θ i to fully define the likelihood model. For the Guassian, θ i = (µ i, Σ i ). Supervised Learning we are working in a supervised situation where we have an set of training data: D = {(x, ω) 1, (x, ω) 2,... (x, ω) N } (1) J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 2 / 39

Overview of the Methods Intuitive Problem: Given a set of training data, D, containing labels for c classes, train the likelihood models p(x ω i, θ i ) by estimating the parameters θ i for i = 1,..., c. J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 3 / 39

Overview of the Methods Intuitive Problem: Given a set of training data, D, containing labels for c classes, train the likelihood models p(x ω i, θ i ) by estimating the parameters θ i for i = 1,..., c. Maximum Likelihood Parameter Estimation Views the parameters as quantities that are fixed by unknown. The best estimate of their value is the one that maximizes the probability of obtaining the samples in D. J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 3 / 39

Overview of the Methods Intuitive Problem: Given a set of training data, D, containing labels for c classes, train the likelihood models p(x ω i, θ i ) by estimating the parameters θ i for i = 1,..., c. Maximum Likelihood Parameter Estimation Views the parameters as quantities that are fixed by unknown. The best estimate of their value is the one that maximizes the probability of obtaining the samples in D. Bayesian Parameter Estimation Views the parameters as random variables having some known prior distribution. The samples converts this prior into a posterior and revises our estimate of the distribution over the parameters. We shall typically see that the posterior is increasingly peaked for larger D. J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 3 / 39

Maximum Likelihood Estimation Maximum Likelihood Intuition p(d θ ) 1 2 3 4 5 6 7 x 1.2 x 10-7 0.8 x 10-7 θ ˆ 0.4 x 10-7 θ 1 2 3 4 5 6 7 l(θ ) -20 Underlying model is assumed to be a Gaussian of particular variance -40 but unknown mean. θ ˆ -60 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 4 / 39

Preliminaries Maximum Likelihood Estimation Separate our training data according to class; i.e., we have c data sets D 1,..., D c. Assume that samples in D i give no information for θ j for all i j. J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 5 / 39

Preliminaries Maximum Likelihood Estimation Separate our training data according to class; i.e., we have c data sets D 1,..., D c. Assume that samples in D i give no information for θ j for all i j. Assume the samples in D j have been drawn independently according to the (unknown but) fixed density p(x ω j ). We say these samples are i.i.d. independent and identically distributed. J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 5 / 39

Preliminaries Maximum Likelihood Estimation Separate our training data according to class; i.e., we have c data sets D 1,..., D c. Assume that samples in D i give no information for θ j for all i j. Assume the samples in D j have been drawn independently according to the (unknown but) fixed density p(x ω j ). We say these samples are i.i.d. independent and identically distributed. Assume p(x ω j ) has some fixed parametric form and is fully described by θ j ; hence we write p(x ω j, θ j ). J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 5 / 39

Maximum Likelihood Estimation Preliminaries Separate our training data according to class; i.e., we have c data sets D 1,..., D c. Assume that samples in D i give no information for θ j for all i j. Assume the samples in D j have been drawn independently according to the (unknown but) fixed density p(x ω j ). We say these samples are i.i.d. independent and identically distributed. Assume p(x ω j ) has some fixed parametric form and is fully described by θ j ; hence we write p(x ω j, θ j ). We thus have c separate problems of the form: Definition Use a set D = {x 1,..., x n } of training samples drawn independently from the density p(x θ) to estimate the unknown parameter vector θ. J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 5 / 39

(Log-)Likelihood Maximum Likelihood Estimation Because we assume i.i.d. we have p(d θ) = n p(x k θ). (2) k=1 The log-likelihood is typically easier to work with both analytically and numerically. l D (θ) l(θ) =. ln p(d θ) (3) n = ln p(x k θ) (4) k=1 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 6 / 39

Maximum Likelihood Estimation Maximum (Log-)Likelihood The maximum likelihood estimate of θ if the value ˆθ that maximizes p(d θ) or equivalently maximizes l D (θ). p(d θ ) ˆθ = arg max l D(θ) (5) θ x 1 2 3 4 5 6 7 1.2 x 10-7 0.8 x 10-7 0.4 x 10-7 l(θ ) -20-40 -60-80 -100 θ ˆ 1 2 3 4 5 6 7 θ ˆ 1 2 3 4 5 6 7 θ θ J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 7 / 39

Maximum Likelihood Estimation Necessary Conditions for MLE For p parameters, θ. = [ θ 1 θ 2... θ p ] T. Let θ be the gradient operator, then θ. = [ θ 1... θ p ] T. The set of necessary conditions for the maximum likelihood estimate of θ are obtained from the following system of p equations: θ l = n θ ln p(x k θ) = 0 (6) k=1 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 8 / 39

Maximum Likelihood Estimation Necessary Conditions for MLE For p parameters, θ. = [ θ 1 θ 2... θ p ] T. Let θ be the gradient operator, then θ. = [ θ 1... θ p ] T. The set of necessary conditions for the maximum likelihood estimate of θ are obtained from the following system of p equations: θ l = n θ ln p(x k θ) = 0 (6) k=1 A solution ˆθ to (6) can be a true global maximum, a local maximum or minimum or an inflection point of l(θ). J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 8 / 39

Maximum Likelihood Estimation Necessary Conditions for MLE For p parameters, θ. = [ θ 1 θ 2... θ p ] T. Let θ be the gradient operator, then θ. = [ θ 1... θ p ] T. The set of necessary conditions for the maximum likelihood estimate of θ are obtained from the following system of p equations: θ l = n θ ln p(x k θ) = 0 (6) k=1 A solution ˆθ to (6) can be a true global maximum, a local maximum or minimum or an inflection point of l(θ). Keep in mind that ˆθ is only an estimate. Only in the limit of an infinitely large number of training samples can we expect it to be the true parameters of the underlying density. J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 8 / 39

Maximum Likelihood Estimation Gaussian Case with Known Σ and Unknown µ For a single sample point x k : ln p(x k µ) = 1 ] [(2π) 2 ln d Σ 1 2 (x k µ) T Σ 1 (x k µ) (7) µ ln p(x k µ) = Σ 1 (x k µ) (8) J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 9 / 39

Maximum Likelihood Estimation Gaussian Case with Known Σ and Unknown µ For a single sample point x k : ln p(x k µ) = 1 ] [(2π) 2 ln d Σ 1 2 (x k µ) T Σ 1 (x k µ) (7) µ ln p(x k µ) = Σ 1 (x k µ) (8) We see that the ML-estimate must satisfy n Σ 1 (x k ˆµ) = 0 (9) k=1 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 9 / 39

Maximum Likelihood Estimation Gaussian Case with Known Σ and Unknown µ For a single sample point x k : ln p(x k µ) = 1 ] [(2π) 2 ln d Σ 1 2 (x k µ) T Σ 1 (x k µ) (7) µ ln p(x k µ) = Σ 1 (x k µ) (8) We see that the ML-estimate must satisfy n Σ 1 (x k ˆµ) = 0 (9) k=1 And we get the sample mean! ˆµ = 1 n n x k (10) k=1 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 9 / 39

Maximum Likelihood Estimation Univariate Gaussian Case with Unknown µ and σ 2 The Log-Likelihood Let θ = (µ, σ 2 ). The log-likelihood of x k is ln p(x k θ) = 1 2 ln [ 2πσ 2] 1 2σ 2 (x k µ) 2 (11) θ ln p(x k θ) = [ ] 1 (x σ 2 k µ) 1 + (x k µ) 2 2σ 2 2σ 4 (12) J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 10 / 39

Maximum Likelihood Estimation Univariate Gaussian Case with Unknown µ and σ 2 Necessary Conditions The following conditions are defined: n k=1 n k=1 1 n ˆσ 2 + k=1 1 ˆσ 2 (x k ˆµ) = 0 (13) (x k ˆµ) 2 ˆσ 4 = 0 (14) J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 11 / 39

Maximum Likelihood Estimation Univariate Gaussian Case with Unknown µ and σ 2 ML-Estimates After some manipulation we have the following: ˆµ = 1 n ˆσ 2 = 1 n n x k (15) k=1 n (x k ˆµ) 2 (16) k=1 These are encouraging results even in the case of unknown µ and σ 2 the ML-estimate of µ corresponds to the sample mean. J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 12 / 39

Bias Maximum Likelihood Estimation The maximum likelihood estimate for the variance σ 2 is biased. The expected value over datasets of size n of the sample variance is not equal to the true variance E [ 1 n ] n (x i ˆµ) 2 i=1 = n 1 n σ2 σ 2 (17) In other words, the ML-estimate of the variance systematically underestimates the variance of the distribution. As n the problem of bias is reduced or removed, but bias remains a problem of the ML-estimator. An unbiased ML-estimator of the variance is ˆσ 2 unbiased = 1 n 1 n (x k ˆµ) 2 (18) k=1 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 13 / 39

FIGURE 3.2. Bayesian learning of the mean of normal distributions in J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 14 / 39 Bayesian Parameter Estimation Bayesian Parameter Estimation Intuition p(µ x 1,x 2,...,x n ) p(µ x 1,x 20 30 3 2 1 10 1 0 1 5-4 -2 0 2 4 µ

Bayesian Parameter Estimation General Assumptions Bayesian Parameter Estimation The form of the density p(x θ) is assumed to be known (e.g., it is a Gaussian). J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 15 / 39

Bayesian Parameter Estimation General Assumptions Bayesian Parameter Estimation The form of the density p(x θ) is assumed to be known (e.g., it is a Gaussian). The values of the parameter vector θ are not exactly known. J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 15 / 39

Bayesian Parameter Estimation General Assumptions Bayesian Parameter Estimation The form of the density p(x θ) is assumed to be known (e.g., it is a Gaussian). The values of the parameter vector θ are not exactly known. Our initial knowledge about the parameters is summarized in a prior distribution p(θ). J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 15 / 39

Bayesian Parameter Estimation General Assumptions Bayesian Parameter Estimation The form of the density p(x θ) is assumed to be known (e.g., it is a Gaussian). The values of the parameter vector θ are not exactly known. Our initial knowledge about the parameters is summarized in a prior distribution p(θ). The rest of our knowledge about θ is contained in a set D of n i.i.d. samples x 1,..., x n drawn according to fixed p(x). J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 15 / 39

Bayesian Parameter Estimation General Assumptions Bayesian Parameter Estimation Goal The form of the density p(x θ) is assumed to be known (e.g., it is a Gaussian). The values of the parameter vector θ are not exactly known. Our initial knowledge about the parameters is summarized in a prior distribution p(θ). The rest of our knowledge about θ is contained in a set D of n i.i.d. samples x 1,..., x n drawn according to fixed p(x). Our ultimate goal is to estimate p(x D), which is as close as we can come to estimating the unknown p(x). J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 15 / 39

Bayesian Parameter Estimation Linking Likelihood and the Parameter Distribution How do we relate the prior distribution on the parameters to the samples? J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 16 / 39

Bayesian Parameter Estimation Linking Likelihood and the Parameter Distribution How do we relate the prior distribution on the parameters to the samples? Missing Data! The samples will convert our prior p(θ) to a posterior p(θ D), by integrating the joint density over θ: p(x D) = p(x, θ D)dθ (19) = p(x θ, D)p(θ D)dθ (20) J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 16 / 39

Bayesian Parameter Estimation Linking Likelihood and the Parameter Distribution How do we relate the prior distribution on the parameters to the samples? Missing Data! The samples will convert our prior p(θ) to a posterior p(θ D), by integrating the joint density over θ: p(x D) = p(x, θ D)dθ (19) = p(x θ, D)p(θ D)dθ (20) And, because the distribution of x is known given the parameters θ, we simplify to p(x D) = p(x θ)p(θ D)dθ (21) J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 16 / 39

Bayesian Parameter Estimation Linking Likelihood and the Parameter Distribution p(x D) = p(x θ)p(θ D)dθ We can see the link between the likelihood p(x θ) and the posterior for the unknown parameters p(θ D). J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 17 / 39

Bayesian Parameter Estimation Linking Likelihood and the Parameter Distribution p(x D) = p(x θ)p(θ D)dθ We can see the link between the likelihood p(x θ) and the posterior for the unknown parameters p(θ D). If the posterior p(θ D) peaks very sharply for sample point ˆθ, then we obtain p(x D) p(x ˆθ). (22) J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 17 / 39

Bayesian Parameter Estimation Linking Likelihood and the Parameter Distribution p(x D) = p(x θ)p(θ D)dθ We can see the link between the likelihood p(x θ) and the posterior for the unknown parameters p(θ D). If the posterior p(θ D) peaks very sharply for sample point ˆθ, then we obtain p(x D) p(x ˆθ). (22) And, we will see that during Bayesian parameter estimation, the distribution over the parameters will get increasingly peaky as the number of samples increases. J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 17 / 39

Bayesian Parameter Estimation Linking Likelihood and the Parameter Distribution p(x D) = p(x θ)p(θ D)dθ We can see the link between the likelihood p(x θ) and the posterior for the unknown parameters p(θ D). If the posterior p(θ D) peaks very sharply for sample point ˆθ, then we obtain p(x D) p(x ˆθ). (22) And, we will see that during Bayesian parameter estimation, the distribution over the parameters will get increasingly peaky as the number of samples increases. What if the integral is not readily analytically computed? J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 17 / 39

Bayesian Parameter Estimation The Posterior Density on the Parameters The primary task in Bayesian Parameter Estimation is the computation of the posterior density p(θ D). By Bayes formula p(θ D) = 1 p(d θ)p(θ) (23) Z Z is a normalizing constant: Z = p(d θ)p(θ)dθ (24) J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 18 / 39

Bayesian Parameter Estimation The Posterior Density on the Parameters The primary task in Bayesian Parameter Estimation is the computation of the posterior density p(θ D). By Bayes formula p(θ D) = 1 p(d θ)p(θ) (23) Z Z is a normalizing constant: Z = p(d θ)p(θ)dθ (24) And, by the independence assumption on D: n p(d θ) = p(x k θ) (25) k=1 Let s see some examples before we return to this general formulation. J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 18 / 39

Bayesian Parameter Estimation Univariate Gaussian Case with Known σ 2 Assume p(x µ) N(µ, σ 2 ) with known σ 2. Whatever prior knowledge we know about µ is expressed in p(µ), which is known. J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 19 / 39

Bayesian Parameter Estimation Univariate Gaussian Case with Known σ 2 Assume p(x µ) N(µ, σ 2 ) with known σ 2. Whatever prior knowledge we know about µ is expressed in p(µ), which is known. Indeed, we assume it took is a Gaussian p(µ) N(µ 0, σ 2 0). (26) µ 0 represents our best guess of the value of the mean and σ 2 0 represents our uncertainty about this guess. J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 19 / 39

Bayesian Parameter Estimation Univariate Gaussian Case with Known σ 2 Assume p(x µ) N(µ, σ 2 ) with known σ 2. Whatever prior knowledge we know about µ is expressed in p(µ), which is known. Indeed, we assume it took is a Gaussian p(µ) N(µ 0, σ 2 0). (26) µ 0 represents our best guess of the value of the mean and σ 2 0 represents our uncertainty about this guess. Note: the choice of the prior as a Gaussian is not so crucial. Rather, the assumption that we know the prior is crucial. J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 19 / 39

Bayesian Parameter Estimation Univariate Gaussian Case with Known σ 2 Training samples We assume that we are given samples D = {x 1,..., x n } from p(x, µ). Take some time to think through this point unlike in MLE, we cannot assume that we have a single value of the parameter in the underlying distribution. J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 20 / 39

Bayesian Parameter Estimation Univariate Gaussian Case with Known σ 2 Bayes Rule p(µ D) = 1 p(d µ)p(µ) (27) Z = 1 p(x k µ)p(µ) (28) Z See how the training samples modulate our prior knowledge of the parameters in the posterior? k J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 21 / 39

Bayesian Parameter Estimation Univariate Gaussian Case with Known σ 2 Expanding... [ p(µ D) = 1 1 Z exp 1 ( ) ] xk µ 2 k 2πσ 2 2 σ [ 1 exp 1 ( ) ] µ 2 (29) µ0 2πσ0 2 2 σ 0 After some manipulation, we can see that p(µ D) is an exponential function of a quadratic of µ, which is another way of saying a normal density. p(µ D) = 1 Z exp [ 1 2 [ ( n σ 2 + 1 σ 2 0 ) ( µ 2 1 2 σ 2 k x k + µ 0 σ 2 0 ) µ ]] (30) J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 22 / 39

Bayesian Parameter Estimation Univariate Gaussian Case with Known σ 2 Names of these convenient distributions... And, this will be true regardless of the number of training samples. In other words, p(µ D) remains a normal as the number of samples increases. Hence, p(µ D) is said to be a reproducing density. p(µ) is said to be a conjugate prior. J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 23 / 39

Bayesian Parameter Estimation Univariate Gaussian Case with Known σ 2 Rewriting... We can write p(µ D) N(µ n, σn). 2 Then, we have [ 1 p(µ D) = exp 1 ( ) ] µ 2 µn 2πσ 2 n 2 σ n (31) The new coefficients are 1 σ 2 n µ n σ 2 n = n σ 2 + 1 σ0 2 = n σ 2 µ n + µ 0 σ0 2 (32) (33) µ n is the sample mean over the n samples. J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 24 / 39

Bayesian Parameter Estimation Univariate Gaussian Case with Known σ 2 Rewriting... Solving explicitly for µ n and σn 2 ( ) nσ 2 µ n = 0 σ 2 nσ0 2 + µ n + σ2 nσ0 2 + µ σ2 0 (34) σ 2 n = σ2 0 σ2 nσ 2 0 + σ2 (35) shows explicitly how the prior information is combined with the training samples to estimate the parameters of the posterior distribution. After n samples, µ n is our best guess for the mean of the posterior and σ 2 n is our uncertainty about it. J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 25 / 39

Bayesian Parameter Estimation Univariate Gaussian Case with Known σ 2 Uncertainty... What can we say about this uncertainty as n increases? σ 2 n = σ2 0 σ2 nσ 2 0 + σ2 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 26 / 39

Bayesian Parameter Estimation Univariate Gaussian Case with Known σ 2 Uncertainty... What can we say about this uncertainty as n increases? σ 2 n = σ2 0 σ2 nσ 2 0 + σ2 That each observation monotonically decreases our uncertainty about the distribution. lim n σ2 n = 0 (36) In other terms, as n increases, p(µ D) becomes more and more sharply peaked approaching a Dirac delta function. J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 26 / 39

Bayesian Parameter Estimation Univariate Gaussian Case with Known σ 2 What can we say about the parameter µ n as n increases? ) µ n + µ n = ( nσ 2 0 nσ 2 0 + σ2 σ 2 nσ 2 0 + σ2 µ 0 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 27 / 39

Bayesian Parameter Estimation Univariate Gaussian Case with Known σ 2 What can we say about the parameter µ n as n increases? ) µ n + µ n = ( nσ 2 0 nσ 2 0 + σ2 σ 2 nσ 2 0 + σ2 µ 0 It is a convex combination between the sample mean µ n (from the observed data) and the prior µ 0. Thus, it always lives somewhere betweens µ n and µ 0. And, it approaches the sample mean as n approaches : lim n µ n = µ n 1 n n x k (37) k=1 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 27 / 39

Bayesian Parameter Estimation Univariate Gaussian Case with Known σ 2 Putting it all together to obtain p(x D). Our goal has been to obtain an estimate of how likely a novel sample x is given the entire training set D: p(x D). p(x D) = p(x µ)p(µ D)dµ (38) [ 1 = exp 1 ( ) ] x µ 2 2πσ 2 2 σ [ 1 exp 1 ( ) ] (39) µ 2 µn 2πσ 2 n 2 = 1 2πσσ n exp [ 1 2 Essentially, p(x D) N(µ n, σ 2 + σ 2 n). σ n (x µ n ) 2 ] σ 2 + σn 2 f(σ, σ n ) (40) J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 28 / 39

Some Comparisons Bayesian Parameter Estimation Maximum Likelihood Point Estimator p(x D) = p(x ˆθ) Parameter Estimate ˆθ = arg max ln p(d θ) θ J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 29 / 39

Some Comparisons Bayesian Parameter Estimation Maximum Likelihood Point Estimator p(x D) = p(x ˆθ) Parameter Estimate ˆθ = arg max ln p(d θ) θ Bayesian Distribution Estimator p(x D) = p(x θ)p(θ D)dθ Distribution Estimate p(θ D) = 1 Z p(d θ)p(θ) J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 29 / 39

Some Comparisons Bayesian Parameter Estimation So, is the Bayesian approach like Maximum Likelihood with a prior? J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 30 / 39

Some Comparisons Bayesian Parameter Estimation So, is the Bayesian approach like Maximum Likelihood with a prior? NO! Maximum Posterior Point Estimator p(x D) = p(x ˆθ) Parameter Estimate ˆθ = arg max ln p(d θ)p(θ) θ Bayesian Distribution Estimator p(x D) = p(x θ)p(θ D)dθ Distribution Estimate p(θ D) = 1 Z p(d θ)p(θ) J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 30 / 39

Bayesian Parameter Estimation Some Comparisons Comments on the two methods For reasonable priors, MLE and BPE are equivalent in the asymptotic limit of infinite training data. Computationally MLE methods are preferred for computational reasons because they are comparatively simpler (differential calculus versus multidimensional integration). J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 31 / 39

Bayesian Parameter Estimation Some Comparisons Comments on the two methods For reasonable priors, MLE and BPE are equivalent in the asymptotic limit of infinite training data. Computationally MLE methods are preferred for computational reasons because they are comparatively simpler (differential calculus versus multidimensional integration). Interpretability MLE methods are often more readily interpretted because they give a single point answer whereas BPE methods give a distribution over answers which can be more complicated. J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 31 / 39

Bayesian Parameter Estimation Some Comparisons Comments on the two methods For reasonable priors, MLE and BPE are equivalent in the asymptotic limit of infinite training data. Computationally MLE methods are preferred for computational reasons because they are comparatively simpler (differential calculus versus multidimensional integration). Interpretability MLE methods are often more readily interpretted because they give a single point answer whereas BPE methods give a distribution over answers which can be more complicated. Confidence In Priors But, the Bayesian methods bring more information to the table. If the underlying distribution is of a different parametric form than originally assumed, Bayesian methods will do better. J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 31 / 39

Bayesian Parameter Estimation Some Comparisons Comments on the two methods For reasonable priors, MLE and BPE are equivalent in the asymptotic limit of infinite training data. Computationally MLE methods are preferred for computational reasons because they are comparatively simpler (differential calculus versus multidimensional integration). Interpretability MLE methods are often more readily interpretted because they give a single point answer whereas BPE methods give a distribution over answers which can be more complicated. Confidence In Priors But, the Bayesian methods bring more information to the table. If the underlying distribution is of a different parametric form than originally assumed, Bayesian methods will do better. Bias-Variance Bayesian methods make the bias-variance tradeoff more explicit by directly incorporating the uncertainty in the estimates. J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 31 / 39

Bayesian Parameter Estimation Some Comparisons Comments on the two methods Take Home Message There are strong theoretical and methodological arguments supporting Bayesian estimation, though in practice maximum-likelihood estimation is simpler, and when used for designing classifiers, can lead to classifiers that are nearly as accurate. J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 32 / 39

Recursive Bayesian Learning Recursive Bayesian Estimation Another reason to prefer Bayesian estimation is that it provides a natural way to incorporate additional training data as it becomes available. Let a training set with n samples be denoted D n. J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 33 / 39

Recursive Bayesian Learning Recursive Bayesian Estimation Another reason to prefer Bayesian estimation is that it provides a natural way to incorporate additional training data as it becomes available. Let a training set with n samples be denoted D n. Then, due to our independence assumption: we have p(d θ) = n p(x k θ) (41) k=1 p(d n θ) = p(x n θ)p(d n 1 θ) (42) J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 33 / 39

Recursive Bayesian Learning Recursive Bayesian Estimation And, with Bayes Formula, we see that the posterior satisfies the recursion p(θ D n ) = 1 Z p(x n θ)p(θ D n 1 ). (43) This is an instance of on-line learning. J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 34 / 39

Recursive Bayesian Learning Recursive Bayesian Estimation And, with Bayes Formula, we see that the posterior satisfies the recursion p(θ D n ) = 1 Z p(x n θ)p(θ D n 1 ). (43) This is an instance of on-line learning. In principle, this derivation requires that we retain the entire training set in D n 1 to calculate p(θ D n ). But, for some distributions, we can simply retain the sufficient statistics, which contain all the information needed. J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 34 / 39

Recursive Bayesian Learning Recursive Bayesian Estimation p(µ x 1,x 2,...,x n ) p(µ x 1,x 2,...,x n ) 20 30 3 2 50 1 1-4 -2 0 2 4 10 5 FIGURE 3.2. Bayesian learning of the mean of normal distributions in one and two dimensions. The posterior distribution estimates are labeled by the number of training samples used in the estimation. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc. µ 1 0-2 -1 0 1 5 2-4 1 12-3 24-2 -1 0 1 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 35 / 39

Recursive Bayesian Learning Example of Recursive Bayes Suppose we believe our samples come from a uniform distribution: { 1/θ 0 x θ p(x θ) U(0, θ) = (44) 0 otherwise Initially, we know only that our parameter θ is bounded by 10, i.e., 0 θ 10. J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 36 / 39

Recursive Bayesian Learning Example of Recursive Bayes Suppose we believe our samples come from a uniform distribution: { 1/θ 0 x θ p(x θ) U(0, θ) = (44) 0 otherwise Initially, we know only that our parameter θ is bounded by 10, i.e., 0 θ 10. Before any data arrive, we have p(θ D 0 ) = p(θ) = U(0, 10). (45) J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 36 / 39

Recursive Bayesian Learning Example of Recursive Bayes Suppose we believe our samples come from a uniform distribution: { 1/θ 0 x θ p(x θ) U(0, θ) = (44) 0 otherwise Initially, we know only that our parameter θ is bounded by 10, i.e., 0 θ 10. Before any data arrive, we have p(θ D 0 ) = p(θ) = U(0, 10). (45) We get a training data set D = {4, 7, 2, 8}. J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 36 / 39

Recursive Bayesian Learning Example of Recursive Bayes When the first data point arrives, x 1 = 4, we get an improved estimate of θ: { p(θ D 1 ) p(x θ)p(θ D 0 1/θ 4 θ 10 ) = 0 otherwise (46) J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 37 / 39

Recursive Bayesian Learning Example of Recursive Bayes When the first data point arrives, x 1 = 4, we get an improved estimate of θ: { p(θ D 1 ) p(x θ)p(θ D 0 1/θ 4 θ 10 ) = 0 otherwise When the next data point arrives, x 2 = 7, we have { p(θ D 2 ) p(x θ)p(θ D 1 1/θ 2 7 θ 10 ) = 0 otherwise (46) (47) J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 37 / 39

Recursive Bayesian Learning Example of Recursive Bayes When the first data point arrives, x 1 = 4, we get an improved estimate of θ: { p(θ D 1 ) p(x θ)p(θ D 0 1/θ 4 θ 10 ) = 0 otherwise When the next data point arrives, x 2 = 7, we have { p(θ D 2 ) p(x θ)p(θ D 1 1/θ 2 7 θ 10 ) = 0 otherwise (46) (47) And so on... J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 37 / 39

Recursive Bayesian Learning Example of Recursive Bayes Notice that each successive data sample introduces a factor of 1/θ into p(x θ). The distribution of samples is nonzero only for x values above the max, p(θ D n ) 1/θ n for max x [D n ] θ 10. Our distribution is J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 38 / 39

Recursive Bayesian Learning Example of Recursive Bayes The maximum likelihood solution is ˆθ = 8, implying p(x D) U(0, 8). But, the Bayesian solution shows a different character: Starts out flat. As more points are added, it becomes increasingly peaked at the value of the highest data point. And, the Bayesian estimate has a tail for points above 8 reflecting our prior distribution. J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 39 / 39