Cheng Soon Ong & Christian Walder. Canberra February June PDF Free Download

Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 305

Part VII Linear Classification 1 Classification Generalised Linear Model Inference and Decision Discriminant Functions 255of 305

Classification Goal : Given input data x, assign it to one of K discrete classes C k where k = 1,..., K. Divide the input space into different regions. Classification Generalised Linear Model Inference and Decision Discriminant Functions Figure: Length of the petal [in cm] for a given sepal [cm] for iris flowers (Iris Setosa, Iris Versicolor, Iris Virginica). 256of 305

How to represent binary class labels? Class labels are no longer real values as in regression, but a discrete set. Two classes : t {0, 1} ( t = 1 represents class C 1 and t = 0 represents class C 2 ) Can interpret the value of t as the probability of class C 1, with only two values possible for the probability, 0 or 1. Note: Other conventions to map classes into integers possible, check the setup. Classification Generalised Linear Model Inference and Decision Discriminant Functions 257of 305

How to represent multi-class labels? If there are more than two classes ( K > 2), we call it a multi-class setup. Often used: 1-of-K coding scheme in which t is a vector of length K which has all values 0 except for t j = 1, where j comes from the membership in class C j to encode. Example: Given 5 classes, {C 1,..., C 5 }. Membership in class C 2 will be encoded as the target vector Classification Generalised Linear Model Inference and Decision Discriminant Functions t = (0, 1, 0, 0, 0) T Note: Other conventions to map multi-classes into integers possible, check the setup. 258of 305

Linear Model Idea: Use again a Linear Model as in regression: y(x, w) is a linear function of the parameters w y(x n, w) = w T φ(x n ) But generally y(x n, w) R. Example: Which class is y(x, w) = 0.71623? Classification Generalised Linear Model Inference and Decision Discriminant Functions 259of 305

Generalised Linear Model Apply a mapping f : R Z to the linear model to get the discrete class labels. Generalised Linear Model Activation function: f ( ) Link function : f 1 ( ) y(x n, w) = f (w T φ(x n )) Classification Generalised Linear Model Inference and Decision Discriminant Functions 1.0 0.5 0.5 0.0 0.5 1.0 0.5 1.0 Figure: Example of an activation function f (z) = sign (z). 260of 305

Three for Decision Problems Find a discriminant function f (x) which maps each input directly onto a class label. 1 Solve the inference problem of determining the posterior class probabilities p(c k x). 2 Use decision theory to assign each new x to one of the classes. Generative 1 Solve the inference problem of determining the class-conditional probabilities p(x C k). 2 Also, infer the prior class probabilities p(c k). 3 Use Bayes theorem to find the posterior p(c k x). 4 Alternatively, model the joint distribution p(x, C k) directly. 5 Use decision theory to assign each new x to one of the classes. Classification Generalised Linear Model Inference and Decision Discriminant Functions 261of 305

Decision Theory - Key Ideas Two classes C 1 and C 2 joint distribution p(x, C k ) using Bayes theorem p(c k x) = p(x C k) p(c k ) p(x) Classification Generalised Linear Model Inference and Decision Discriminant Functions Example: cancer treatment (k = 2) data x : an X-ray image C 1 : patient has cancer (C 2 : patient has no cancer) p(c 1 ) is the prior probability of a person having cancer p(c 1 x) is the posterior probability of a person having cancer after having seen the X-ray data 262of 305

Decision Theory - Key Ideas Need a rule which assigns each value of the input x to one of the available classes. The input space is partitioned into decision regions R k. Leads to decision boundaries or decision surfaces probability of a mistake p(mistake) = p(x R 1, C 2 ) + p(x R 2, C 1 ) = p(x, C 2 ) dx + p(x, C 1 ) dx R 1 R 2 Classification Generalised Linear Model Inference and Decision Discriminant Functions R1 R2 263of 305

Decision Theory - Key Ideas probability of a mistake p(mistake) = p(x R 1, C 2 ) + p(x R 2, C 1 ) = p(x, C 2 ) dx + p(x, C 1 ) dx R 1 R 2 goal: minimize p(mistake) Classification Generalised Linear Model Inference and Decision Discriminant Functions p(x, C 1 ) x 0 x p(x, C 2 ) x R 1 R 2 264of 305

Decision Theory - Key Ideas multiple classes instead of minimising the probability of mistakes, maximise the probability of correct classification p(correct) = = K p(x R k, C k ) k=1 K k=1 R k p(x, C k ) dx Classification Generalised Linear Model Inference and Decision Discriminant Functions 265of 305

Minimising the Expected Loss Not all mistakes are equally costly. Weight each misclassification of x to the wrong class C j instead of assigning it to the correct class C k by a factor L kj. The expected loss is now E [L] = L kj p(x, C k )dx k j R j Classification Generalised Linear Model Inference and Decision Discriminant Functions Goal: minimize the expected loss E [L] 266of 305

Two Classes Definition A discriminant is a function that maps from an input vector x to one of K classes, denoted by C k. Consider first two classes ( K = 2 ). Construct a linear function of the inputs x Classification Generalised Linear Model Inference and Decision Discriminant Functions y(x) = w T x + w 0 such that x being assigned to class C 1 if y(x) 0, and to class C 2 otherwise. weight vector w bias w 0 ( sometimes w 0 called threshold ) 267of 305

Two Classes Decision boundary y(x) = 0 is a (D 1)-dimensional hyperplane in a D-dimensional input space (decision surface). w is orthogonal to any vector lying in the decision surface. Proof: Assume x A and x B are two points lying in the decision surface. Then, Classification Generalised Linear Model Inference and Decision Discriminant Functions 0 = y(x A ) y(x B ) = w T (x A x B ) 268of 305

Two Classes The normal distance from the origin to the decision surface is w T x w = w 0 w y > 0 x 2 y = 0 y < 0 R 1 R 2 Classification Generalised Linear Model Inference and Decision Discriminant Functions w x y(x) w x x 1 w0 w 269of 305

Two Classes The value of y(x) gives a signed measure of the perpendicular distance r of the point x from the decision surface, r = y(x)/ w. y(x) = w T x {( }}{ 0 x + r w ) +w 0 = r wt w {}}{ w w + w T x + w 0 = r w Classification Generalised Linear Model Inference and Decision Discriminant Functions y > 0 x2 y = 0 y < 0 R1 R2 w x y(x) w x x1 w0 w 270of 305

Two Classes More compact notation : Add an extra dimension to the input space and set the value to x 0 = 1. Also define w = (w 0, w) and x = (1, x) Classification Generalised Linear Model Inference and Decision Discriminant Functions y(x) = w T x Decision surface is now a D-dimensional hyperplane in a D + 1-dimensional expanded input space. 271of 305

Part VIII Linear Classification 2 Generative 272of 305

Three for Decision Problems In increasing order of complexity Find a discriminant function f (x) which maps each input directly onto a class label. 1 Solve the inference problem of determining the posterior class probabilities p(c k x). 2 Use decision theory to assign each new x to one of the classes. Generative 1 Solve the inference problem of determining the class-conditional probabilities p(x C k). 2 Also, infer the prior class probabilities p(c k). 3 Use Bayes theorem to find the posterior p(c k x). 4 Alternatively, model the joint distribution p(x, C k) directly. 5 Use decision theory to assign each new x to one of the classes. Generative 273of 305

Generative Generative approach: model class-conditional densities p(x C k ) and priors p(c k ) to calculate the posterior probability for class C 1 p(x C 1 )p(c 1 ) p(c 1 x) = p(x C 1 )p(c 1 ) + p(x C 2 )p(c 2 ) 1 = 1 + exp( a(x)) = σ(a(x)) where a and the logistic sigmoid function σ(a) are given by a(x) = ln p(x C 1) p(c 1 ) p(x C 2 ) p(c 2 ) = ln p(x, C 1) p(x, C 2 ) 1 σ(a) = 1 + exp( a). Generative 274of 305

Logistic Sigmoid The logistic sigmoid function σ(a) = 1 1+exp( a) "squashing function because it maps the real axis into a finite interval (0, 1) σ( a) = 1 σ(a) Derivative d daσ(a) = σ(a) σ( a) = σ(a) (1 σ(a)) ( ) Inverse is called logit function a(σ) = ln σ 1 σ Generative 1.0 0.8 0.6 4 2 0.4 0.2 0.4 0.6 0.8 1.0 Σ 2 0.2 4 10 Logistic Sigmoid σ(a) a 6 Logit a(σ) 275of 305

Generative - Multiclass The normalised exponential is given by where p(c k x) = p(x C k) p(c k ) j p(x C j) p(c j ) = exp(a k) j exp(a j) a k = ln(p(x C k ) p(c k )). Also called softmax function as it is a smoothed version of the max function. Example: If a k a j for all j k, then p(c k x) 1, and p(c j x) 0. Generative 276of 305

Probabil. Generative Model - Assume class-conditional probabilities are Gaussian, all classes share the same covariance. What can we say about the posterior probabilities? { 1 1 p(x C k ) = exp 1 } (2π) D/2 Σ 1/2 2 (x µ k) T Σ 1 (x µ k ) { 1 1 = exp 1 } (2π) D/2 Σ 1/2 2 xt Σ 1 x exp {µ Tk Σ 1 x 12 } µtk Σ 1 µ k Generative where we separated the quadratic term in x and the linear term. 277of 305

Probabil. Generative Model - For two classes and a(x) is Therefore where p(c 1 x) = σ(a(x)) a(x) = ln p(x C 1) p(c 1 ) p(x C 2 ) p(c 2 ) = ln exp { } µ T 1 Σ 1 x 1 2 µt 1 Σ 1 µ 1 exp { } + ln p(c 1) µ T 2 Σ 1 x 1 2 µt 2 Σ 1 µ 2 p(c 2 ) p(c 1 x) = σ(w T x + w 0 ) Generative w = Σ 1 (µ 1 µ 2 ) w 0 = 1 2 µt 1 Σ 1 µ 1 + 1 2 µt 2 Σ 1 µ 2 + ln p(c 1) p(c 2 ) 278of 305

Probabil. Generative Model - Class-conditional densities for two classes (left). Posterior probability p(c 1 x) (right). Note the logistic sigmoid of a linear function of x. Generative 279of 305

General Case - K Classes, Shared Covariance Use the normalised exponential where p(c k x) = p(x C k)p(c k ) j p(x C j)p(c j ) = exp(a k) j exp(a j) to get a linear function of x where a k = ln (p(x C k )p(c k )). a k (x) = w T k x + w k0. Generative w k = Σ 1 µ k w k0 = 1 2 µt k Σ 1 µ k + p(c k ). 280of 305

General Case - K Classes, Different Covariance If each class-conditional probability has a different covariance, the quadratic terms 1 2 xt Σ 1 x do not longer cancel each other out. We get a quadratic discriminant. Generative 2.5 2 1.5 1 0.5 0 0.5 1 1.5 2 2.5 2 1 0 1 2 281of 305

Maximum Likelihood Solution Given the functional form of the class-conditional densities p(x C k ), can we determine the parameters µ and Σ? Generative 282of 305

Maximum Likelihood Solution Given the functional form of the class-conditional densities p(x C k ), can we determine the parameters µ and Σ? Not without data ;-) Given also a data set (x n, t n ) for n = 1,..., N. (Using the coding scheme where t n = 1 corresponds to class C 1 and t n = 0 denotes class C 2. Assume the class-conditional densities to be Gaussian with the same covariance, but different mean. Denote the prior probability p(c 1 ) = π, and therefore p(c 2 ) = 1 π. Then Generative p(x n, C 1 ) = p(c 1 )p(x n C 1 ) = π N (x n µ 1, Σ) p(x n, C 2 ) = p(c 2 )p(x n C 2 ) = (1 π) N (x n µ 2, Σ) 283of 305

Maximum Likelihood Solution Thus the likelihood for the whole data set X and t is given by p(t, X π, µ 1, µ 2, Σ) = N [π N (x n µ 1, Σ)] tn n=1 Maximise the log likelihood The term depending on π is [(1 π) N (x n µ 2, Σ)] 1 tn N (t n ln π + (1 t n ) ln(1 π)) n=1 which is maximal for π = 1 N N n=1 t n = N 1 N = N 1 N 1 + N 2 where N 1 is the number of data points in class C 1. Generative 284of 305

Maximum Likelihood Solution Similarly, we can maximise the log likelihood (and thereby the likelihood p(t, X π, µ 1, µ 2, Σ) ) depending on the mean µ 1 or µ 2, and get µ 1 = 1 N t n x n N 1 n=1 µ 2 = 1 N (1 t n ) x n N 2 n=1 For each class, this are the means of all input vectors assigned to this class. Generative 285of 305

Maximum Likelihood Solution Finally, the log likelihood ln p(t, X π, µ 1, µ 2, Σ) can be maximised for the covariance Σ resulting in Σ = N 1 N S 1 + N 2 N S 2 S k = 1 N k n C k (x n µ k )(x n µ k ) T Generative 286of 305

- Naive Bayes Assume the input space consists of discrete features, in the simplest case x i {0, 1}. For a D-dimensional input space, a general distribution would be represented by a table with 2 D entries. Together with the normalisation constraint, this are 2 D 1 independent variables. Grows exponentially with the number of features. The Naive Bayes assumption is that all features conditioned on the class C k are independent of each other. p(x C k ) = D i=1 µ xi k i (1 µ ki ) 1 xi Generative 287of 305

- Naive Bayes With the naive Bayes p(x C k ) = D i=1 µ xi k i (1 µ ki ) 1 xi we can then again find the factors a k in the normalised exponential p(c k x) = p(x C k)p(c k ) j p(x C j)p(c j ) = exp(a k) j exp(a j) as a linear function of the x i Generative a k (x) = D {x i ln µ ki + (1 x i ) ln(1 µ ki )} + ln p(c k ). i=1 288of 305

Maximise a likelihood function defined through the conditional distribution p(c k x) directly. Discriminative training Typically fewer parameters to be determined. As we learn the posteriror p(c k x) directly, prediction may be better than with a generative model where the class-conditional density assumptions p(x C k ) poorly approximate the true distributions. But: discriminative models can not create synthetic data, as p(x) is not modelled. Generative 290of 305

Original Input versus Feature Space Used direct input x until now. All classification algorithms work also if we first apply a fixed nonlinear transformation of the inputs using a vector of basis functions φ(x). Example: Use two Gaussian basis functions centered at the green crosses in the input space. Generative x2 1 0 φ2 1 0.5 1 0 1 0 1 x1 0 0.5 1 φ1 291of 305

Original Input versus Feature Space Linear decision boundaries in the feature space correspond to nonlinear decision boundaries in the input space. Classes which are NOT linearly separable in the input space can become linearly separable in the feature space. BUT: If classes overlap in input space, they will also overlap in feature space. Nonlinear features φ(x) can not remove the overlap; but they may increase it! Generative 1 1 x2 φ2 0 0.5 1 0 1 0 1 x1 0 0.5 1 φ1 292of 305

Original Input versus Feature Space Fixed basis functions do not adapt to the data and therefore have important limitations (see discussion in Linear ). Understanding of more advanced algorithms becomes easier if we introduce the feature space now and use it instead of the original input space. Some applications use fixed features successfully by avoiding the limitations. We will therefore use φ instead of x from now on. Generative 293of 305

is Classification Two classes where the posterior of class C 1 is a logistic sigmoid σ() acting on a linear function of the feature vector φ p(c 1 φ) = y(φ) = σ(w T φ) p(c 2 φ) = 1 p(c 1 φ) Model dimension is equal to dimension of the feature space M. Compare this to fitting two Gaussians }{{} 2M + M(M + 1)/2 = M(M + 5)/2 }{{} means shared covariance Generative For larger M, the logistic regression model has a clear advantage. 294of 305

is Classification Determine the parameter via maximum likelihood for data (φ n, t n ), n = 1,..., N, where φ n = φ(x n ). The class membership is coded as t n {0, 1}. Likelihood function p(t w) = where y n = p(c 1 φ n ). N n=1 y tn n (1 y n ) 1 tn Error function : negative log likelihood resulting in the cross-entropy error function Generative N E(w) = ln p(t w) = {t n ln y n + (1 t n ) ln(1 y n )} n=1 295of 305

is Classification Error function (cross-entropy error ) E(w) = N {t n ln y n + (1 t n ) ln(1 y n )} n=1 y n = p(c 1 φ n ) = σ(w T φ n ) Gradient of the error function (using dσ da E(w) = N (y n t n )φ n n=1 = σ(1 σ) ) gradient does not contain any sigmoid function for each data point error is product of deviation y n t n and basis function φ n. BUT : maximum likelihood solution can exhibit over-fitting even for many data points; should use regularised error or MAP then. Generative 296of 305

Given a continous distribution p(x) which is not Gaussian, can we approximate it by a Gaussian q(x)? Need to find a mode of p(x). Try to find a Gaussian with the same mode. Generative 0.8 0.6 0.4 0.2 0 2 1 0 1 2 3 4 40 30 20 10 0 2 1 0 1 2 3 4 Non-Gaussian (yellow) and Gaussian approximation (red). Negative log of the Non-Gaussian (yellow) and Gaussian approx. (red). 297of 305

Assume p(x) can be written as p(z) = 1 Z f (z) with normalisation Z = f (z) dz. Furthermore, assume Z is unknown! A mode of p(z) is at a point z 0 where p (z 0 ) = 0. Taylor expansion of ln f (z) at z 0 ln f (z) ln f (z 0 ) 1 2 A(z z 0) 2 Generative where A = d2 dz 2 ln f (z) z=z 0 298of 305

Exponentiating we get ln f (z) ln f (z 0 ) 1 2 A(z z 0) 2 f (z) f (z 0 ) exp{ A 2 (z z 0) 2 }. And after normalisation we get the Laplace approximation q(z) = ( ) 1/2 A exp{ A 2π 2 (z z 0) 2 }. Generative Only defined for precision A > 0 as only then p(z) has a maximum. 299of 305

- Vector Space Approximate p(z) for z R M p(z) = 1 Z f (z). we get the Taylor expansion ln f (z) ln f (z 0 ) 1 2 (z z 0) T A(z z 0 ) where the Hessian A is defined as A = ln f (z) z=z0. The Laplace approximation of p(z) is then q(z) = { A 1/2 exp 1 } (2π) M/2 2 (z z 0) T A(z z 0 ) = N (z z 0, A 1 ) Generative 300of 305

Bayesian Exact Bayesian inference for the logistic regression is intractable. Why? Need to normalise a product of prior probabilities and likelihoods which itself are a product of logistic sigmoid functions, one for each data point. Evaluation of the predictive distribution also intractable. Therefore we will use the Laplace approximation. Generative 301of 305

Bayesian Assume a Gaussian prior because we want a Gaussian posterior. p(w) = N (w m 0, S 0 ) for fixed hyperparameter m 0 and S 0. Hyperparameters are parameters of a prior distribution. In contrast to the model parameters w, they are not learned. For a set of training data (x n, t n ), where n = 1,..., N, the posterior is given by p(w t) p(w)p(t w) Generative where t = (t 1,..., t N ) T. 302of 305

Bayesian Using our previous result for the cross-entropy function E(w) = ln p(t w) = N {t n ln y n + (1 t n ) ln(1 y n )} n=1 we can now calculate the log of the posterior p(w t) p(w)p(t w) using the notation y n = σ(w T φ n ) as ln p(w t) = 1 2 (w m 0) T S 1 0 (w m 0) + N {t n ln y n + (1 t n ) ln(1 y n )} n=1 Generative 303of 305

Bayesian To obtain a Gaussian approximation to ln p(w t) = 1 2 (w m 0) T S 1 0 (w m 0) + N {t n ln y n + (1 t n ) ln(1 y n )} n=1 1 Find w MAP which maximises ln p(w t). This defines the mean of the Gaussian approximation. (Note: This is a nonlinear function in w because y n = σ(w T φ n ).) 2 Calculate the second derivative of the negative log likelihood to get the inverse covariance of the Laplace approximation S N = ln p(w t) = S 1 0 + N y n (1 y n )φ n φ T n. n=1 Generative 304of 305

Bayesian The approximated Gaussian (via Laplace approximation) of the posterior distribution is now where q(w φ) = N (w w MAP, S N ) S N = ln p(w t) = S 1 0 + N y n (1 y n )φ n φ T n. n=1 Generative 305of 305

Cheng Soon Ong & Christian Walder. Canberra February June 2018