Bayes Decision Theory

Size: px

Start display at page:

Download "Bayes Decision Theory"

Kathlyn Miles
5 years ago
Views:

1 Bayes Decision Theory Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density 0

2 Minimum-Error-Rate Classification Actions are decisions on classes If action α i is taken and the true state of nature is ω j then: decision is correct if i = j and in error if i j Seek a decision rule that minimizes the probability of error which is the error rate

3 Minimum Error Rate Classifier Derivation zero-one loss function: 0 i = j λ( α i, ω j = i, j = i j,...,c Therefore, the conditional risk is: R( α i x = j= c j= λ( α ω P( ω i j j x = j P( ω j x = P( ω i x The risk corresponding to this loss function is the average probability error Minimize the risk requires maximize P(ω i x (since R(α i x = P(ω i x For Minimum error rate Decide ω i if P (ω i x > P(ω j x j i

4 Regions of decision and zero-one loss function, Let λ λ Likelihood Ratio Classification λ λ P( ω. P( ω = θ then decide ω P( x ω λ θ λ P( x ω If λ is the zero-one loss function which means: if : > 0 λ = 0 P( ω then θ λ = = θ a P( ω 0 if λ = then θ 0 Class-conditional pdfs λ P( ω = = θ b P( ω Likelihood Ratio p(x/ω /p(x/ω. If we use a zero-one loss function decision boundaries are determined by threshold θ a. If loss function penalizes miscategorizing ω as ω more than converse we get larger threshold θ b and hence R becomes smaller 3

5 Classifiers, Discriminant Functions and Decision Surfaces Many methods of representing pattern classifiers Set of discriminant functions g i (x, i =,, c Classifier assigns feature x to class ω i if g i (x > g j (x j i Classifier is a machine that computes c discriminant functions Functional structure of a general statistical pattern Classifier with d inputs and c discriminant functions g i (x 4

6 Forms of Discriminant Functions Let g i (x = - R(α i x (max. discriminant corresponds to min. risk! For the minimum error rate, we take g i (x = P(ω i x (max. discrimination corresponds to max. posterior! g i (x P(x ω i P(ω i g i (x = ln P(x ω i + ln P(ω i 5

7 Decision Region Feature space divided into c decision regions if g i (x > g j (x j i then x is in R i -D, two-category classifier with Gaussian pdfs Decision Boundary = two hyperbolas Hence decision region R is not simply connected Ellipses mark where density is /e times that of peak distribution 6

8 The Two-Category case A classifier is a dichotomizer that has two discriminant functions g and g Let g(x g (x g (x Decide ω if g(x > 0 ; Otherwise decide ω The computation of g(x g( x = = ln P( ω x P( x ω P( x ω P( ω + ln x P( ω P( ω 7

9 The Normal Distribution A bell-shaped distribution defined by the probability density function p x µ ( σ ( x = e πσ If the random variable X follows a normal distribution, then The probability that X will fall into the interval (a,b is given by b a Expected, or mean, value of X is Variance of X is Standard deviation of X,, is p ( x dx Var E [ X ] = xp( x dx = µ ( x = E[( x ] = ( x µ p( x dx = σ σ µ σ = σ x 8

10 Relationship between Entropy and Normal Density Entropy of a distribution H ( p( x = p( xln p( x dx Measured in nats. If log is uses the unit is bits Entropy measures uncertainty in the values of points selected randomly from a distribution Normal distribution has maximum entropy over all distributions having a given mean and variance 9

11 Normal Distribution, Mean 0, Standard Deviation With 80% confidence the r.v. will lie in the two-sided interval[-.8,.8] 0

12 The Normal Density in Pattern Recognition Univariate density Analytically tractable, continuous A lot of processes are asymptotically Gaussian Central Limit Theorem: aggregate effect of a sum of a large number of small, independent random disturbances will lead to a Gaussian distribution Handwritten characters, speech sounds are ideal or prototype corrupted by random process P( x = exp π σ x µ σ Where: µ = mean (or expected value of x σ = expected squared deviation or variance, Univariate normal distribution has roughly 95% of its area in the range x-µ <σ. The peak of the distribution has value p(µ=/sqrt(πσ

13 Multivariate density where: Multivariate normal density in d dimensions is: p( x p( x = d (π abbreviated ~ / Σ as / N( µ, Σ exp ( x µ ( x µ x = (x, x,, x d t (t stands for the transpose vector form µ = (µ, µ,, µ d t mean vector Σ = d*d covariance matrix Σ and Σ - are determinant and inverse respectively t Σ

14 Mean and Covariance Matrix Formal Definitions µ = Ε[ x] = xp( x dx = Ε[( x µ ( x µ t = ( x µ ( x µ t p( x dx Mean vector has its components which are means of variables Covariance : Diagonal elements are variances of variables Cross-diagonal elements are covariances of pairs of variables Statistical independence means off-diagonal elements are zero 3

15 Multivariate Normal Density Specified by d+d(d+/ parameters: mean and independent elements of covariance matrix Locii of points of constant density are hyperellipsoids Samples drawn from a -D Gaussian lie in a cloud centered at the mean µ. Ellipses show CSE 555: lines Srihari of equal probability density of the Gaussian 4

Linear Combinations of Normally distributed variables are normally distributed Φ of and A is diagonal matrix of then the transformation A to the identity matrix Whitening Transform is matrix whose

16 Linear Combinations of Normally distributed variables are normally distributed Φ of and A is diagonal matrix of then the transformation A to the identity matrix Whitening Transform is matrix whose columns are the orthonormal Eigen vectors w = ΦA corresponding Eigen values / applied to the coordinates ensures that transformed distribution has covariance matrix equal Action of a linear transformation on the feature space will convert an arbitrary normal distribution into another normal distribution. One transformation A, takes the source distribution into distribution N(A t µ,a t ΣA. Another linear transformation- a projection P onto a Line defined by vector a leads to N(µ, σ measured along that line. While the transforms yield distributions in a different space they are shown superimposed on the original x -x space. A whitening transform A w leads to a circularly symmetric Gaussian, here shown displaced. 5

17 Mahanalobis Distance r = ( x µ t Σ ( x µ is the Mahanalobis distance from x toµ Contours of constant Density are hyperellipsoids of constant Mahanalobis Distance For a given dimensionality Scatter of samples varies directly with E / Samples drawn from a -D Gaussian lie in a cloud centered at the mean µ. Ellipses show lines of equal probability density of the Gaussian 6

Bayesian Decision Theory

Bayesian Decision Theory Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2017 CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 1 / 46 Bayesian