Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Similar documents
Statistical and Learning Techniques in Computer Vision Lecture 1: Random Variables Jens Rittscher and Chuck Stewart

Parametric Techniques Lecture 3

Parametric Techniques

p(d θ ) l(θ ) 1.2 x x x

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture : Probabilistic Machine Learning

COMP90051 Statistical Machine Learning

Density Estimation. Seungjin Choi

CSC321 Lecture 18: Learning Probabilistic Models

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

COS513 LECTURE 8 STATISTICAL CONCEPTS

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Lecture 4: Probabilistic Learning

Introduction to Probabilistic Machine Learning

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

6.867 Machine Learning

Multivariate statistical methods and data mining in particle physics

6.867 Machine Learning

Curve Fitting Re-visited, Bishop1.2.5

Parameter Estimation. Industrial AI Lab.

Machine Learning 2017

Second-Order Polynomial Models for Background Subtraction

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Nonparametric Methods Lecture 5

Bayesian Decision and Bayesian Learning

Machine Learning Lecture 2

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Probability Models for Bayesian Recognition

Minimum Error-Rate Discriminant

Introduction to Machine Learning

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

CS 195-5: Machine Learning Problem Set 1

Bayesian Methods: Naïve Bayes

p(x ω i 0.4 ω 2 ω

Expectation Propagation for Approximate Bayesian Inference

Unsupervised Learning with Permuted Data

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Expect Values and Probability Density Functions

Bayesian Machine Learning

Linear Classifiers as Pattern Detectors

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Density Estimation: ML, MAP, Bayesian estimation

Bayesian Machine Learning

Machine Learning

The Bayes classifier

Hierarchical Models & Bayesian Model Selection

an introduction to bayesian inference

Introduction to Machine Learning

Topic 12 Overview of Estimation

Introduction to Machine Learning. Lecture 2

Maximum Likelihood Estimation. only training data is available to design a classifier

Machine Learning Lecture 2

GWAS IV: Bayesian linear (variance component) models

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Lecture 1: Bayesian Framework Basics

Probabilistic Machine Learning. Industrial AI Lab.

The Expectation-Maximization Algorithm

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Nonparameteric Regression:

Introduction to Bayesian Learning. Machine Learning Fall 2018

Naïve Bayes Lecture 17

Naïve Bayes classification

Bayesian RL Seminar. Chris Mansley September 9, 2008

Nonparametric Bayesian Methods (Gaussian Processes)

Mixtures of Gaussians. Sargur Srihari

Lecture 6: Model Checking and Selection

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

7. Estimation and hypothesis testing. Objective. Recommended reading

Machine Learning. Classification. Bayes Classifier. Representing data: Choosing hypothesis class. Learning: h:x a Y. Eric Xing

Clustering on Unobserved Data using Mixture of Gaussians

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

The Expectation-Maximization Algorithm

Machine Learning

Advanced statistical methods for data analysis Lecture 1

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

COURSE INTRODUCTION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Day 5: Generative models, structured classification

Generative Clustering, Topic Modeling, & Bayesian Inference

David Giles Bayesian Econometrics

Bayesian Deep Learning

GAUSSIAN PROCESS REGRESSION

BAYESIAN DECISION THEORY

Lecture Notes on the Gaussian Distribution

GWAS V: Gaussian processes

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Bayesian Learning (II)

Statistics: Learning models from data

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Probabilistic Graphical Models for Image Analysis - Lecture 1

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Recent Advances in Bayesian Inference Techniques

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Time Series and Dynamic Models

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

COM336: Neural Computing

INTRODUCTION TO PATTERN RECOGNITION

CPSC 540: Machine Learning

Transcription:

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart 1 Motivation and Problem In Lecture 1 we briefly saw how histograms could be used to model the probability of pixel intensity values: 0.6 0.4 0.2 Traffic 0 0 50 100 150 200 250 P b (x) z 0.6 0.6 0.4 0.4 0.2 0.2 0 0 50 100 150 200 250 z 0 0 50 100 150 200 250 z P c(x) P s(x) Figure 1: Pixels as random variables. The image shows a sample frame of a surveillance camera installed to monitor traffic. In order to build a background model of the scene the set of pixels was devided into three different sets: background, cars, and shadow. The three graphs show the corresponding histograms of the grey values of pixel locations from the different sets. Notice that the distribution of the background pixels P b (x) is sharply peaked. The distribution of greyvalues of pixels P c(x) that correspond to vehicles is, on the other hand, almost uniform. Histogram-based modeling is just one (simple) solution to a question of primary interest in this course: how to model probability density functions. Many learning problems may be cast as distribution modeling problems. In the next three lectures we will discuss density modeling techniques, including techniques more sophisticated than histograms As we will see, there are several limitations to the use of histograms. 1

Here is a more careful, though abstract, statement of our problem: Given a finite sample of data points X = {x i } all drawn from the same underlying density p(x), determine an accurate model of p(x). We will generally be interested in representation of continuous densities. For now, the superscript on the data refers to the index of the (vectorvalued) sample, whereas subscripts will indicate the index of a component within a particular sample. 2 Modeling approaches Parametric models: the probability density function has a specific functional form that depends on a vector of parameters, θ. We write this as the conditional density p(x θ) (1) This is the probability of x given parameter vector θ. The normal distribution is an example of a parametric model. Non-parametric: The form of the density is entirely determined by the data without any model. The histograms in figure 1 are an example of such a nonparametric representation. Semi-parametric: The combination of parametric and non-parametric models allows a broad class of functional forms in which the number of parameters can be increased in a systematic way to build more flexible models. The total number of parameters can be varied independent of the size of the data set. Gaussian mixture models are an example of such a class of distributions. In general, the difficulty of generating accurate probability models increases dramatically in high dimensions. 3 Lectures 2 through 4 Today s lecture and Lectures 3 and 4 cover the above three types of modeling approaches, using them to study a number of different techniques that will be important throughout the semester. Today: parametric models, maximum likelihood estimation and Bayesian estimation Lecture 3: kernel based methods, plus an application in registration based on mutual information. Lecture 4: semi-parametric models, mixture models and the expectation maximization algorithm. 2

4 Maximum Likelihood Estimation Recall that we are given samples X = {x i } and we want to estimate the parameters θ. The density model is p(x θ). (2) For a given set of parameters θ, the probability of an individual x i is p(x i θ). (3) Making the reasonable assumption that the points are independent of each other (and of course follow the same distribution), the probability of obtaining X is p(x θ) = N p(x i θ). (4) This is the likelihood function for a particular parameter vector θ, written as L(θ) = i=1 N p(x i θ). (5) i=1 Our goal is to find the value of θ that maximizes this likelihood. Because the log is a monotonic function, we can obtain the same result by minimizing the negative log of L(θ): ln L(θ) = l(θ) = N ln p(x i θ). (6) i=1 This form is often simpler and therefore the one usually used. Given the resulting estimate ˆθ, the desired density is p(x ˆθ. (7) 5 Two Examples In class we will work through two examples: 1. p(x θ) is the univariate Gaussian distribution. In this case θ = (µ, σ). The result will be the maximum-likelihoood estimates of the mean and variance of the Gaussian distribution. We will show that the resulting estimate is (slightly) biased. 3

2. p(x θ) is the error of a point from a regression plane. The form of the regression plane is the explicit function y = a 0 + a 1 x 1 + a k x k ( 1 x) = (a 0,..., a k ) ( ) = a 1 x (8) The data points are x = ( x, y), where y is a measurement and x is the location at which the measurement is made. If we assume the measurement errors are normally distributed from the line and are independent with unknown variance σ 2, then the parameter vector to be estimated is θ = (a, σ 2 ) (9) Here there is no mean parameter, µ. Instead, the plane plays the role of the mean vector. We will show in class that maximum likelihood estimation leads to the usual least-squares plane parameter estimate. 6 Bayesian Density Estimation Overview We will consider this in two parts 1. Computing the probability p(θ X ) (10) which is the conditional ( posterior ) probability density function of the parameter vector given data X. Formally, this is very different from maximum likelihood estimation, which is only interested in a particular (most likely) set of estimated parameters. Recall that the Maximum Likelihood estimate was determined by maximizing p(x θ) with respect to θ. The important new information that is now being considered is the prior distribution p(θ). 2. Use the estimated probability p(θ X ) to estimate the probability p(x X ), i.e. the density of the variable x given the samples X. We do not write p(x θ) as with maximum likelihood estimation because we will eliminate θ through integration! This allows us to consider all possible values of the parameter vector θ, in a true Bayesian style, This is used when you want to classify a new measurement based on the probability of obtaining it given previous measurements. 4

7 Bayesian Density Estimation Part 1 Estimating p(θ X ), the posterior distribution: Using Bayes theorem, p(θ X ) = p(x θ)p(θ). (11) p(x ) Notice that the denominator does not involve θ. Therefore we ignore it (for now) and later normalize the resulting function to make it a density. Dropping the denominator gives p(θ X ) p(x θ)p(θ). (12) The first factor is the conditional density from Equation 4 that we used to obtain the maximum likelihood estimate. The second factor is the prior probability. It represents what is known in advance about the parameter vector. The result, p(θ X ), represents what is known about θ after analyzing the data set X. This is known as the posterior density. Closed forms for p(θ X ) may only be obtained in relatively simple circumstances. In general, numerical methods are needed. The maximum a posteriori estimate MAP of the parameter θ is defined as ˆθ = max θ p(θ X ). (13) As a simple example, to be worked in class, we will estimate the posterior distribution of the mean µ of a univariate distribution, assuming the samples are normally distributed with the unknown mean µ and a known variance σ 2, and the prior of µ is normally distributed itself with mean µ 0 and variance σ 2 0. 8 Bayesian Density Estimation Part 2 Estimating p(x X ) from p(θ X ) We have a density that ranges over all θ. The problem is to obtain a density of x. This is done using marginalization: p(x X ) = p(x, θ X )dθ. (14) On the right is the joint density of x and θ (given X ) and the θ term is integrated out (in the expression). 5

Next we use the rules of conditional densities and the properties of the resulting densities to write: p(x X ) = p(x, θ X )dθ = p(x θ, X )p(θ X )dθ = p(x θ)p(θ X )dθ (15) In the final integral, the first factor is just the functional form of the density model and it does not depend on X, while the second factor is the posterior probability estimated in part 1. This integral is generally quite complex. We will proceed with our example from above in class. 9 Maximum Likelihood vs. Bayesian The maximum likelihood estimate (MLE) is a single estimated parameter vector, and is used as the parameter vector in our parametric density modeling problem to produce the final density. Bayesian estimation produces a posterior distribution (12) and this distribution (after proper normalization) is then combined with the parametric density model through multiplication and integration to produce our final density. In many cases we are only interested in the parameter estimate. In this case the Bayesian posterior is maximized to produce the desired estimate. This is called the maximum a posteriori (MAP) estimate. The MLE and MAP estimates are close when the data overwhelms the prior and exactly the same when the prior used is non-informative. Note that in all these cases we always consider one particular model. Later on we will us a similar analysis in order to perform model selection. 10 Further reading The presentation here is mostly based on Duda, Hart and Stork [DHS01, Chapter 3]. A different, but nice summary of the material relevant to probability density estimation can be found in Bishop [Bis06, Chapter 1]. 6

References [Bis06] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. [DHS01] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classification. John Wiley and Sons, 2001. 7