Bayesian Decision and Bayesian Learning

Bayesian Decision and Bayesian Learning Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1 / 30

Bayes Rule p(x ω i ) Likelihood p(ω i ) Prior p(ω i x) Posterior Bayes Rule In other words p(ω i x) = p(x ω i)p(ω i ) p(x) = p(x ω i)p(ω i ) i p(x ω i)p(ω i ) posterior likelihood prior 2 / 30

Outline Bayesian Decision Theory Bayesian Classification Maximum Likelihood Estimation and Learning Bayesian Estimation and Learning 3 / 30

Action and Risk Classes: {ω 1, ω 2,..., ω c } Actions: {α 1, α 2,..., α a } Loss: λ(α k ω i ) Conditional risk: R(α k x) = c λ(α k ω i )p(ω i x) i=1 Decision function, α(x), specifies a decision rule. Overall risk: R = x R(α(x) x)p(x)dx It is the expected loss associated with a given decision rule. 4 / 30

Bayesian Decision and Bayesian Risk Bayesian decision α = argmin R(α k x) k This leads to the minimum overall risk. (why?) Bayesian risk: the minimum overall risk R = R(α x)p(x)dx Bayesian risk is the best one can achieve. x 5 / 30

Example: Minimum-error-rate classification Let s have a specific example of Bayesian decision In classification problems, action α k corresponds to ω k Let s define a zero-one loss function λ(α k ω i ) = { 0 i = k 1 i k i, k = 1,..., c This means: no loss for correct decisions & all errors are equal It easy to see: the conditional risk error rate R(α k x) = i k P(ω i x) = 1 P(ω k x) Bayesian decision rule minimum-error-rate classification Decide ω k if P(ω k x) > P(ω i x) i k 6 / 30

Outline Bayesian Decision Theory Bayesian Classification Maximum Likelihood Estimation and Learning Bayesian Estimation and Learning 7 / 30

Classifier and Discriminant Functions Discriminant function: g i (x), i = 1,..., C, assigns ω i to x Classifier Examples: x ω i if g i (x) > g j (x) g i (x) = P(ω i x) g i (x) = P(x ω i )P(ω i ) j i g i (x) = ln P(x ω i ) + ln P(ω i ) Note: the choice of D-function is not unique, but they may give equivalent classification result. Decision region: the partition of the feature space x R i if g i (x) > g j (x) j i Decision boundary: 8 / 30

Multivariate Gaussian Distribution p(x) = N(µ, Σ) = { 1 (2π) d/2 exp 1 } Σ 1/2 2 (x µ)t Σ 1 (x µ) x 2 principal axis p 2 p 1 x 1 The principal axes (the direction) are given by the eigenvectors of the covariance matrix Σ The length of the axes (the uncertainty) is given by the eigenvalues of Σ 9 / 30

Mahalanobis Distance Mahalanobis distance is a normalized distance x µ m = (x µ) T Σ 1 (x µ) x 2 principal axis p 1 µ 2 p 2 µ 2 p 1 µ m = p 2 µ m p 2 p 1 x 1 10 / 30

Whitening To refresh your memory: A linear transformation of a Gaussian is still a Gaussian. p(x) = N(µ, Σ), and y = A T x p(y) = N(A T µ, A T ΣA) Question: Find one such that the covariance becomes an identity matrix (i.e., each component has equal uncertainty) y = A w x x y Whitening is a transform that de-couples the correlation. A w = U T Λ 1 2, where Σ = U T ΛU prove it: A T w ΣA w = I 11 / 30

Discriminant Functions for Gaussian Densities Minimum-error-rate classifier g i (x) = ln p(x ω i ) + ln p(ω i ) When using Gaussian densities, it is easy to see: g i (x) = 1 2 (x µ i) T Σ 1 i (x µ i ) d 2 ln 2π 1 2 ln Σ i + ln p(ω i ) 12 / 30

Case I: Σ i = σ 2 I g i (x) = x µ i 2 2σ 2 +ln p(ω i ) = 1 2σ 2 [xt x 2µ T i x+µ T i µ i ]+ln p(ω i ) Notice that x T x is a constant. Equivalently we have g i (x) = [ 1 σ 2 µ i] T x + [ 1 2σ 2 µt i µ i + ln p(ω i )] This leads to a linear discriminant function g i (x) = W T i x + W i0 At the decision boundary g i (x) = g j (x), which is linear: x 0 = 1 2 (µ i + µ j ) W T (x x 0 ) = 0, where W = µ i µ j and σ 2 µ i µ j 2 ln p(ω i) p(ω j ) (µ i µ j ) 13 / 30

To See it Clearly... Let s view a specific case, where p(ω i ) = p(ω j ). The decision boundary we have: What does it mean? (µ i µ j ) T (x µ i + µ j ) = 0 2 The boundary is the perpendicular bisector of the two Gaussian densities! what if p(ω i ) p(ω j )? 14 / 30

Case II: Σ i = Σ g i (x) = 1 2 (x µ i) T Σ 1 (x µ i ) + ln p(ω i ) Similarly, we can have an equivalent one: g i (x) = (Σ 1 µ i ) T x + ( 1 2 µt i Σ 1 µ i + ln(ω i )) The discriminant function and decision boundary are still linear: W T (x x 0 ) = 0 x 0 = 1 2 (µ i + µ j ) where W = Σ 1 (µ i µ j ) and ln p(ω i ) ln p(ω j ) (µ i µ j ) T Σ 1 (µ i µ j ) (µ i µ j ) Note: Compared with Case I, the Euclidean distance is replaced by Mahalanobis distance. The boundary is still linear, but the hyperplane is no longer orthogonal to µ i µ j. 15 / 30

Case III: Σ i = arbitrary where g i (x) = x T A i x + W i x + W i0, A i = 1 2 Σ 1 i W i = Σ 1 i µ i W i0 = 1 2 µt i Σ 1 i µ i 1 2 ln Σ i + ln p(ω i ) Note: The decision boundary is no longer linear! It is byperquadrics. 16 / 30

Outline Bayesian Decision Theory Bayesian Classification Maximum Likelihood Estimation and Learning Bayesian Estimation and Learning 17 / 30

Learning Learning means training i.e., estimating some unknowns from training samples Why? It is very difficult to specify these unknowns Hopefully, these unknowns can be recovered from examples given 18 / 30

Maximum Likelihood (ML) Estimation Collected samples D = {x 1, x 2,..., x n } Estimate unknown parameters Θ in the sense that the data likelihood is maximized Data likelihood n p(d Θ) = p(x k Θ) Log likelihood ML estimation k=1 L(Θ) = ln p(d Θ) = Θ = argmax Θ n p(x k Θ) k=1 p(d Θ) = argmax Θ L(D Θ) 19 / 30

Example I: Gaussian densities (unknown µ) ln p(x k µ) = 1 2 ln((2π)d Σ ) 1 2 (x k µ) T Σ 1 (x k µ) Its partial derivative is: ln p(x k µ) µ = Σ 1 (x k µ) So the KKT condition writes: n Σ 1 (x k ˆµ) = 0 k=1 It is easy to see the ML estimate of µ is: ˆµ = 1 n n k=1 x k This is exactly what we do in practice. 20 / 30

Example II: Gaussian densities (unknown µ and Σ) Let s do the univariate case first. Denote σ 2 by. ln p(x k µ, ) = 1 2 ln 2π 1 2 (x k µ) 2 The partial derivatives and KKT conditions are: { ln p(xk µ, ) µ = 1 (x k µ) = ln p(x k µ, ) = 1 2 + (x k µ) 2 2 2 So we have n ˆµ = 1 n x k k=1 n ˆσ 2 = 1 n (x k ˆµ) 2 k=1 generalize = { n 1ˆ (x k=1 k ˆµ) = 0 n k=1 { 1ˆ + (x k ˆµ) 2 } = 0 ˆ 2 ˆµ = 1 n ˆΣ = 1 n n k=1 x k n (x k ˆµ)(x k ˆµ) T k=1 21 / 30

Outline Bayesian Decision Theory Bayesian Classification Maximum Likelihood Estimation and Learning Bayesian Estimation and Learning 22 / 30

Bayesian Estimation Collect samples D = {x 1, x 2,..., x n }, drawn independently from a fixed but unknown distribution p(x) Bayesian estimation uses D to determine p(x D), i.e., to learn a p.d.f. The unknown Θ is a random variable (or random vector), i.e., Θ is drawn from p(θ). p(θ) is unknown, but has a parametric form with parameters Θ p(θ) We hope p(θ) is sharply peaked at the true value. Differences from ML in Bayesian estimation, Θ is not a value, but a random vector and we need to recover the distribution of Θ, rather than a single value. p(x D) also needs to be estimated 23 / 30

Two Problems in Bayesian Learning It is clear based on the total probability rule that p(x D) = p(x, Θ D)dΘ = p(x Θ)p(Θ D)dΘ p(x D) is a weighted average over all Θ if p(θ D) peaks very sharply about some value ˆΘ, then p(x D) can be approximated by p(x ˆΘ) X D The generation of the observation D can be illustrated in a graphical model. The two problems are estimating p(θ D) estimating p(x D) 24 / 30

Example: The Univariate Gaussian Case p(µ D) Assume µ is the only unknown and it has a known Gaussian prior: p(µ) = N(µ 0, σ 2 0 ). i.e., µ 0 is the best guess of µ, and σ 0 is its uncertainty Assume a Gaussian likelihood, p(x µ) = N(µ, σ 2 ) It is clear that p(µ D) p(d µ)p(µ) = n p(x k µ)p(µ) k=1 where p(x k µ) = N(µ, σ 2 ) and p(µ) = N(µ 0, σ 2 0) Let s prove that p(µ D) is still a Gaussian density (why?) hint : p(µ D) exp{ 1 2 ( n σ 2 + 1 σ0 2 )µ 2 2( 1 n σ 2 x k + µ 0 σ 2 )µ} k=1 25 / 30

What is Going on? As D is collection of n samples, let s denote p(µ D) = N(µ n, σ 2 n). Denote µ n = 1 n n k=1 x k. µ n represents our best guess for µ after observing n samples σ 2 n measures the uncertainty of this guess So, what is really going on here? We can obtain the following: (prove it!) µ n = nσ2 0 µ nσ0 2 n + σ2 µ +σ2 nσ0 2 0 +σ2 σ 2 n = σ2 0 σ2 nσ 2 0 +σ2 Data fidelity Prior µ n is the data fidelity µ 0 is the prior µ n is a tradeoff (weighed average) between them 26 / 30

Example: The Univariate Gaussian case p(x D) After obtaining the posterior p(µ D), we can estimate p(x D) p(x D) = p(x µ)p(µ D)dµ It is the convolution of two Gaussian distributions. You can easily prove: do it! p(x D) = N(µ n, σ 2 + σ 2 n) 27 / 30

What is really going on? Let s study σ n σn 2 monotonically each additional observation decreases the uncertainty of the estimate σ 2 n σ2 n 0 Let s study p(x D) p(x D) = N(µ n, σ 2 + σn) 2 p(µ D) becomes more and more sharply peaked let s discuss if σ 0 = 0, then what? if σ 0 σ, then what? 28 / 30

Example: The Multivariate case We can generalize the univariate case to multivariate Gaussian, p(x µ) = N(µ, Σ), p(µ) = N(µ 0, Σ 0 ). µ n = Σ 0 (Σ 0 + 1 n Σ) 1 µ n + 1 n Σ(Σ 0 + 1 n Σ) 1 µ 0 Σ n = 1 n Σ 0(Σ 0 + 1 n Σ) 1 Σ Actually, this is the best linear unbiased estimate (BLUE). In addition, p(x D) = N(µ n, Σ + Σ n ) 29 / 30

Recursive Learning Bayesian estimation can be done recursively, i.e., updating the previous estimates with new data. Denote D n = {x 1, x 2,..., x n } Easy to see p(d n Θ) = p(x n Θ)p(D n 1 Θ), and p(θ D 0 ) = p(θ) The recursion is: p(θ D n ) = p(x n Θ)p(Θ D n 1 ) p(xn Θ)p(Θ D n 1 )dθ So we start from p(θ), then move on to p(θ x 1 ), p(θ x 1, x 2 ), and so on 30 / 30