Linear Classification: Probabilistic Generative Models

Size: px

Start display at page:

Download "Linear Classification: Probabilistic Generative Models"

Elwin Ball
5 years ago
Views:

1 Linear Classification: Probabilistic Generative Models Sargur N. University at Buffalo, State University of New York USA 1

2 Linear Classification using Probabilistic Generative Models Topics 1. Overview (Generative vs Discriminative 2. Bayes Classifier using Logistic Sigmoid and Softmax 3. Continuous inputs Gaussian Distributed Class-conditionals Parameter Estimation 4. Discrete Features 5. Exponential Family 2

3 Overview of Methods for Classification 1. Generative Models (Two-step 1. Infer class-conditional densities p(x C k and priors p(c k 2. Use Bayes theorem to determine posterior probabilities 2. Discriminative Models (One-step Directly infer posterior probabilities p(c k x Decision Theory p(c k x = p(x C k p(c k p(x In both cases use decision theory to assign each new x to a class 3

4 Generative Model Model class conditionals p(x C k, priors p(c k Compute posteriors p(c k x from Bayes theorem Two class Case Posterior for class C 1 is p(c 1 x = p(x C 1 p(c 1 p(x C 1 p(c 1 + p(x C 2 p(c 2 Since 1 = 1+ exp( a = σ (a where a = ln p(x C p(c 1 1 p(x C 2 p(c 2 p(x = p(x,c i = p(x C i p(c i i i LLR with Bayes odds 4

5 σ(a Logistic Sigmoid Function !5 0 5 a Sigmoid: S -shaped or squashing function maps real a ε (-, + to finite (0,1 interval Note: Dotted line is scaled probit function cdf of a zero-mean unit variance Gaussian σ(a = exp( a Property :σ( a =1 σ(a σ Inverse : a = ln 1 σ If σ (a = P(C 1 x then Inverse represents ln[p(c 1 x/p(c 2 x Log ratio of probabilities called logit or log odds 5

6 Generalizations and Special Cases More than 2 classes Gaussian Distribution of x Discrete Features Exponential Family 6

Softmax: Generalization of logistic sigmoid For K=2 we used logistic sigmoid p(c 1 x=σ(a where 0 For K > 2, we can use its generalization p(c k x = j = p(x C k p(c k p(x C j p(c j j exp(a k exp(a j a

7 Softmax: Generalization of logistic sigmoid For K=2 we used logistic sigmoid p(c 1 x=σ(a where 0 For K > 2, we can use its generalization p(c k x = j = p(x C k p(c k p(x C j p(c j j exp(a k exp(a j a = ln p(x C 1 p(c 1 p(x C 2 p(c 2 Quantities a k are defined by a k =ln p(x C k p(c k Known as the soft-max function Log ratio of probabilities Since it is a smoothed max function If K=2 this reduces to a sigmoid p(c 1 x=exp(a 1 / [exp(a 1 +exp (a 2 ] =1/ [1+ exp (a 2 -a 1 ] =1/ [1+ exp (lnp(x C 2 p(c 2 -ln(x C 1 p(c 1 ] =1/ [1+ p(x C 2 p(c 2 / p(x C 1 p(c 1 ] =1/ [1+ exp (-a] where a = ln p(x C 1 p(c 1 p(x C 2 p(c 2 If a k >>a j for all j k then p(c k x =1 and 0 for rest σ(a = 1 + exp( a!5 0 5 A general technique for finding max of several a k 7

8 Specific forms of class-conditionals We will next see that linear classifiers occur both in continuous and discrete cases as consequences of choosing specific forms of the class-conditional densities p(x C k Looking first at continuous input variables x Then discussing discrete inputs 8

9 Continuous Inputs: Gaussians Assume Gaussian class-conditional densities with same covariance matrix Σ 1 1 p(x C k = (2π exp 1 D/2 Σ 1/2 2 (x µ k T Σ 1 (x µ k Consider first two-class case. Substituting into And rearranging we get where p(c 1 x = σ ln p(x C p(c 1 1 p(x C 2 p(c 2 p(c 1 x = σ(w T x + w 0 w = Σ 1 (µ 1 µ 2 w 0 = 1 2 µ T Σ 1 µ µ T Σ 1 µ ln p(c 1 p(c 2 Quadratic terms in x from the exponents of the Gaussians have cancelled due to common covariance matrices The argument of the logistic sigmoid is a linear function of x

Two Gaussian Classes Two-dimensional input space x

of a linear function of x Red ink proportional to

10 Two Gaussian Classes Two-dimensional input space x =(x 1,x 2 Class-conditional densities p(x C k Posterior p(c 1 x Linear Decision boundary Values are positive (need not sum to 1 A logistic sigmoid of a linear function of x Red ink proportional to p(c 1 x Blue ink to p(c 2 x=1-p(c 1 x Value 1 or 0 10

11 Continuous case with K >2 p(c k x = j = p(x C k p(c k p(x C j p(c j exp(a k j exp(a j With Gaussian class conditionals where a k (x = w k T x + w k 0 w k = Σ 1 µ k w k 0 = 1 2 µ T Σ 1 µ k k + ln p(c k Quadratic terms cancel thereby leading to linearity If we did not assume shared covariance matrix we get a quadratic discriminant 11

Three-class case with Gaussian models Both Linear and Quadratic Decision boundaries 2.

12 Three-class case with Gaussian models Both Linear and Quadratic Decision boundaries !0.5!1!1.5!2!2.5!2! Class-conditional Densities C 1 and C 2 have same covariance matrix Posterior Probabilities Between C 1 and C 2 boundary is linear, Others are quadratic RGB values correspond to posterior probabilities 12

13 Maximum Likelihood Solutions Once we have specified a parametric functional forms for the class-conditional densities p(x C k we can then determine the parameters together with the prior probabilities p(c k using maximum likelihood This requires a data set of observations x along with their class labels 13

14 M.L.E. for Gaussian Parameters Assuming parametric forms for p(x C k we can determine values of parameters and priors p(c k using maximum likelihood where t =(t 1,..,t N T Convenient to maximize log of likelihood 14

15 Max Likelihood for Prior and Means Estimates for prior probabilities MLE for p is Fraction of points Estimates for class means Mean of all input vectors x n assigned to class C 1 15

16 Max Likelihood for Covariance Matrix Solution for Shared Covariance Matrix Pick out terms in log-likelihood function depending on Σ Weighted average of the two separate covariance matrices 16

17 p(x C k = Discrete Features Assuming binary features With M inputs, distribution is a table of 2 M values Naive Bayes assumption: independent features Class-conditional distributions have the form M i=1 x i µ ki a k (x = ln(p(x C k p(c k M (1 µ ki 1 x i Substituting in the form needed for normalized exponential { } + ln p(c k = x i ln µ ki + (1 x i ln(1 µ ki i=1 x i {0,1} which is linear in x Similar results for discrete variables which take more than 2 values

18 Exponential Family We have seen that for both Gaussian distributed and discrete inputs, the posterior class probabilities are given by generalized linear models with logistic sigmoid (K=2 or softmax (K 2 activation functions These are particular cases of a more general result obtained by assuming that the classconditional densities p(x C k are members of the exponential family of distributions 18

19 Exponential Family Definition Class-conditionals that belong to the exponential family have the general form { } p(x λ k = h(xg(λ k exp λ k T u(x Where λ k are natural parameters of the distribution, u(x is a function of x and g (λ k is a coefficient that ensures distribution is normalized Restricting attention to the subclass of such distributions for which u(x=x and introducing a scaling parameter s we obtain the form p(x λ k,s = 1 s h(1 s xg(λ exp 1 k s λkt x Note that each class has its own parameter vector λ k but share a scale parameter

20 Exponential Family Sigmoidal form For the two-class problem Substitute expressions for the class conditional densities into a = ln p(x C p(c 1 1 and we see that the p(x C 2 p(c 2 posterior probability is given by a logistic sigmoid acting on a linear function a(x a(x = (λ 1 λ 2 T x + lng(λ 1 lng(λ 2 + ln p(c 1 ln p(c 2 For the K-class problem Substituting the class-conditional density expression into a k =ln p(x C k p(c k and we get a k (x = λ k T x + lng(λ k + ln p(c k which is again a linear function of x 20

21 Summary of probabilistic linear classifiers Defined using logistic sigmoid p(c 1 x = σ (a where a is LLR with Bayes odds soft-max functions p(c k x = exp(a k exp(a j j Continuous case with shared covariance we get linear functions of input x Discrete case with independent features also results in linear functions 21

Probabilistic generative models

Probabilistic generative models Linear models for classification Francesco Corona Probabilistic discriminative models Models with linear decision boundaries arise from assumptions about the data In a generative approach to classification,