Probabilistic generative models

Size: px

Start display at page:

Download "Probabilistic generative models"

Kristopher Norton
5 years ago
Views:

1 Linear models for classification Francesco Corona

3 Probabilistic discriminative models Models with linear decision boundaries arise from assumptions about the data In a generative approach to classification, we first model the class-conditional densities p(x C k) and the class priors p(c k), and then we compute posterior probabilities p(c k x) through Bayes theorem

4 Probabilistic discriminative models (cont.) For the two-class problem, the posterior probability of class C 1 is p(c 1 x) = p(x C 1)p(C 1) p(x C 1)p(C 1)+p(x C 2)p(C 2) }{{} p(x)= k p(x,c k)= k p(x C k)p(c k ) = 1 = σ(a) (1) 1+exp( a) where we defined a = ln p(x C1)p(C1) p(x C 2)p(C 2) (2) σ(a) is the logistic sigmoid function (plotted in red) 1 σ(a) = 1 1+exp( a) (3) or squashing function, because it maps R onto a finite interval

5 Probabilistic discriminative models (cont.) The logistic sigmoid satisfies the following symmetry property σ( a) = 1 σ(a) (4) The inverse of the logistic sigmoid is known as logit function ( σ ) a = ln 1 σ It reflects the log of the ratio of probabilities for two classes (5) ln(p(c 1 x)/p(c 2 x))

6 Probabilistic discriminative models (cont.) p(c 1 x) = p(x C 1)p(C 1) p(x C 1)p(C 1)+p(x C 2)p(C 2) 1 = 1+exp ( ln p(x C1)p(C1) ) p(x C 2)p(C 2) }{{} a ( = σ ln p(x C1)p(C1) ) p(x C 2)p(C 2) }{{} a We have written the posterior probabilities in an equivalent form that will have significance when a(x) is a linear function of x Here, the posterior probability is governed by a generalised linear model

7 Probabilisticgenerative models Probabilistic discriminative models (cont.) For the case K > 2 classes, we have p(c k x) = p(x Ck)p(Ck) K j=1 p(x Cj)p(Cj) = exp(a k) K j=1 exp(aj) (6) known as normalised exponential 1 We have defined the quantity a k as If a k >> a j, for all j k, then a k = ln ( ) p(x C k)p(c k) { p(c k x) 1 p(c j x) 0 (7) We analyse the consequences of choosing the form of class-conditional densities 1 It is a generalisation of the logistic sigmoid and it is also known as the softmax function

8 Outline

10 Let us assume that the class-conditional densities p(x C k) are Gaussian p(x C k) = 1 1 ( (2π) D/2 Σ 1/2exp 1 ) 2 (x µk)t Σ 1 (x µ k) (8) and, we want to explore the form of the posterior probabilities p(c k x) The Gaussians have different means µ k but share the same covariance matrix Σ

11 (cont.) p(x C 1)p(C 1) With 2 classes, p(c 1 x) = p(x C = 1 1)p(C 1)+p(x C 2)p(C 2) 1+exp( a) = σ(a) and a = ln p(x C1)p(C1), we have p(x C 2)p(C 2) where p(c 1 x) = σ(w T x+w 0) (9) w = Σ 1 (µ 1 µ 2) (10) w 0 = 1 2 µ1σ 1 µ µ2σ 1 µ 2 +ln p(c1) p(c 2) The quadratic terms in x from the exponents of the Gaussian densities have cancelled (due to the assumption of common covariance matrices) leading to a linear function of x in the argument of the logistic sigmoid (11)

12 (cont.) The left-hand plot shows the class-conditional densities for two classes over 2D The posterior probability p(c 1 x) is a logistic sigmoid of a linear function of x The surface in the right-hand plot is coloured using a proportion of red given by p(c 1 x) and a proportion of blue given by p(c 2 x) = 1 p(c 1 x)

13 (cont.) Decision boundaries are surfaces with constant posterior probabilities p(c k x) Linear functions of x Linear in input space Prior probabilities p(c k) enter only through the bias parameter w 0 so changes in priors have the effect of making parallel shifts of the decision boundary more generally of the parallel contours of constant posterior probability

14 (cont.) For the K-class case, using p(c k x) = p(x Ck)p(Ck) K j=1 p(x Cj)p(Cj) = exp(a k) K and j=1exp(aj) a k = ln(p(x C k)p(c k)), we have a k(x) = wk T x+w k0 (12) w k = Σ 1 µ k (13) w k0 = 1 2 µt k Σ 1 µ k +lnp(c k) (14) The a k(x) are again linear functions of x as a consequence of the cancellation of the quadratic terms due to the shared covariances The resulting decision boundaries (minimum misclassification rate) occur when two of the posterior probabilities (the two largest) are equal, and so they are defined by linear functions of x again we have a generalised linear model

15 (cont.) If we relax the assumption of a shared covariance matrix and allow each class-conditional density p(x C k) to have its own covariance matrix Σ k, then the earlier cancellations will no longer occur, and we will obtain quadratic functions of x, giving rise to a quadratic discriminant

16 (cont.) Class-conditional densities for three classes each having a Gaussian distribution red and green classes have the same covariance matrix The corresponding posterior probabilities and the decision boundaries Linear boundary between red and green classes, same covariance matrix Quadratic boundaries between other pairs, different covariance matrix

18 Once we specified a parametric functional form for class-conditional densities p(x C k), we can determine parameters and prior class probabilities p(c k) Maximum likelihood This requires data comprising observations of x and corresponding class labels

19 (cont.) Consider first the two-class case, each having a Gaussian density with shared covariance matrix Σ, and suppose we have data {x n,t n} N n=1 { t n = 1, for C 1 with prior probability p(c 1) = π t n = 0, for C 2 with prior probability p(c 2) = 1 π For a data point x n from class C 1 (C 2), we have t n = 1 (t n = 0) and thus p(x n,c 1) = p(c 1)p(x n C 1) = πn(x n µ 1,Σ) p(x n,c 2) = p(c 2)p(x n C 2) = (1 π)n(x n µ 2,Σ) For t = (t 1,...,t n) T, the likelihood function is given by p(t,x π,µ 1,µ 2,Σ) = N n=1 ( ) tn ( 1 tn πn(x n µ 1,Σ) (1 π)n(x n µ 2,Σ)) (15)

20 (cont.) As usual, we maximise the log of the likelihood function N t nln(π)+(1 t n)ln(1 π) +t nln(n(x n µ 1,Σ)) }{{}}{{} π n=1 µ 1,Σ +(1 t n)ln(n(x n µ 2,Σ)) }{{} µ 2,Σ } {{ } µ 1,µ 2,Σ

21 (cont.) Consider first maximisation with respect to π, where the terms on π are N n=1 ( ) t nln(π)+(1 t n)ln(1 π) Setting the derivative wrt π to zero and rearranging (16) π = 1 N N n=1 t n = N1 N = N 1 N 1 +N 2 (17) The maximum likelihood estimate for π is the fraction of points in C 1

22 (cont.) Now consider maximisation with respect to µ 1, where the terms on µ 1 are N n=1 ( ) t nln N(x n µ 1,Σ) = 1 2 N t n(x n µ 1) T Σ 1 (x n µ 1)+const (18) n=1 Setting the derivative wrt µ 1 to zero and rearranging µ 1 = 1 N t nx n (19) N 1 The maximum likelihood estimate of µ 1 is the mean of inputs x n in class C 1 n=1 µ 2 = 1 N t nx n (20) N 2 n=1

23 (cont.) Lastly consider maximisation with respect to Σ, where the terms on Σ are 1 N t 1 N nln Σ t n(x n µ 1) T Σ 1 (x n µ 1) 2 2 where n=1 n=1 n=1 1 N (1 t 1 N n)ln Σ (1 t n)(x n µ 2) T Σ 1 (x n µ 2) 2 2 n=1 = N 2 ln Σ N 2 Tr(Σ 1 S) (21) S = N1 N2 S1 + S2 (22) N N S 1 = 1 (x n µ 1)(x n µ 1) T (23) N 1 n C 1 S 2 = 1 N 2 n C 2 (x n µ 2)(x n µ 2) T (24)

24 (cont.) Σ = S = N1 N 1 (x n µ 1)(x n µ 1) T + N2 1 (x n µ 2)(x n µ 2) T N 1 N N 2 n C 1 n C 2 It is the average of the covariance matrices associated with each class separately

Linear Classification: Probabilistic Generative Models

Linear Classification: Probabilistic Generative Models Sargur N. University at Buffalo, State University of New York USA 1 Linear Classification using Probabilistic Generative Models Topics 1. Overview