Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro, Nam-gu, Pohang 790-784, Korea dkim@postech.ac.kr

Contents 4.1. Discriminant Functions 4.2. Probabilistic Generative Models 4.3 Probabilistic Discriminative Models 4.4 he Laplace Approximation 4.5 Bayesian Logistic Regression 2

Classification Models Linear classification model (D-1)-dimensional hyperplane for D-dimensional input space 1-of-K coding scheme for K>2 classes, such as t = (0, 1, 0, 0, 0) Discriminant function Directly assigns each vector x to a specific class. ex. Fishers linear discriminant Approaches using conditional probability pc k x Separation of inference and decision states wo approaches Direct modeling of the posterior probability Generative approach Modeling likelihood and prior probability to calculate the posterior probability Capable of generating samples 3

Discriminant Functions-wo Classes Classification by hyperplanes y x yx w x w if 0, xc otherwise, x C 0 1 2 or y x w x w where w, w and x 1, x 0 4

Discriminant Functions-Multiple Classes One-versus-the-rest classifier K-1 classifiers for a K-class discriminant Ambiguous when more than two classifiers say yes. One-versus-one classifier K(K-1)/2 binary discriminant functions Majority voting ambiguousness with equal scores One-versus-the-rest One-versus-one 5

Discriminant Functions-Multiple Classes (Cont d) K-class discriminant comprising K linear functions Assigns x to the corresponding class having the maximum output. x k wk x k y w 0, k 1,..., K he decision regions are always singly connected and convex. x C if y x y x for j k k k j xa xb Ck xˆ xa xb yk xˆ yk xa yk xb x x x x y xˆ y xˆ j k For,, let 1. hen 1. y y and y y for j k, k A j A k B j B therefore for. k j 6

Approaches for Learning Parameters for Linear Discriminant Functions Least square method Fisher s linear discriminant Relation to least squares Multiple classes Perceptron algorithm 7

Least Square Method Minimization of the sum-of-squares error (SSE) 1-of-K binary coding scheme for the target vector t. y x W x W w 1 w2 wk wk wk 0, wk where... and. For a training data set, {x n, t n } where n = 1,,N. he sum of squares error function is W XW XW 1 ED r, 2 Minimizing SSE gives where X x x... x and t t... t. 1 2 N 1 2 1 N W X X X X Pseudo inverse. 8

Least Square Method (Cont d) -Limit and Disadvantage he least-squares solutions yields y(x) whose elements sum to 1, but do not ensure the outputs to be in the range [0,1]. Vulnerable to outliers Because SSE function penalizes too correct examples i.e. far from the decision boundary. ML under Gaussian conditional distribution Unimodal vs. multimodal 9

Least Square Method (Cont d) -Limit and Disadvantage Lack of robustness comes from Least square method corresponds to the maximum likelihood under the assumption of Gaussian distribution. Binary target vectors are far from this assumption. Least square solution Logistic regression 10

Fisher s Linear Discriminant Linear classification model as dimensionality reduction from the D-dimensional space to one dimension. In case of two classes y w x if y w, then x C otherwise, x C 0 1 Finding w such that the projected data are clustered well. 2 11

Fisher s Linear Discriminant (Cont d) Maximizing projected mean distance? he distance between the cluster means, m 1 and m 2 projected onto w. m m w m m 2 1 2 1 m 1 1 x and m x 1 n 2 N1 N nc 2 nc 1 2 Not appropriate when the covariances are nondiagonal. n 12

Fisher s Linear Discriminant (Cont d) Integrate the within-class variance of the projected data. Finding w that maximizes J(w). 2 m2 m1 2 2 J w, where s y m J w 2 2 i 2 s s B w S w w SW w J(w) is maximized when B k n k n C k S m m m m 2 1 2 1 1 2 Fisher s linear discriminant w SW m2 m1 If the within-class covariance is isotropic, w is proportional to the difference of the class means as in the previous case. S B : Between-class covariance matrix S W : Within-class covariance matrix S x m x m x m x m W n 1 n 1 n 2 n 2 nc nc w S ws w B W W B 1 w S w S w in the direction of (m 2 -m 1 ) 13

Fisher s Linear Discriminant -Relation to Least Squares- Fisher criterion as a special case of least squares When setting target values as: N/N 1 for class C 1 and N/N 2 for class C 2. N N 2 de / dw w xn w0 tn 0 0 w x n1 n 0 n N n1 de / dw 0 w x w0 t 1 E w t 2 N 1 1 0 w m, where n 1 1 2 2 Nx m m N n1 w m N N NN 1 2 SW SB N N w m m 1 2. n1 by solving (2) with the w 0 (1) x 0 (2) n n n by solving (1). 0 above. w S 1 m m S w : always in the direction of m m W 1 2. B 2 1 14

Fisher s Discriminant for Multiple Classes K > 2 classes Dimension reduction from D to D D > 1 linear features, y k (k = 1,,D ) k Generalization of S W and S B K k y w x 1 S S, where S x m x m and m x. W k k n k n k k n N k 1 nc k nc K N N K 1 1 S x m x m, where m x N m. n n n k k N N n1 n1 k 1 S S S W B S N m m m m B k k k k 1. k S B is from the decomposition of total covariance matrix (Duda and Hart, 1997) k 15

Fisher s Discriminant for Multiple Classes (Cont d) Covariance matrices in the projected y-space K K sw yk μk yk μk and sb Nk μk μμk μ, k 1 nc k 1 k 1 1 where μ y and μ N μ. k n k k Nk N nc k 1 k K Fukunaga s criterion Another criterion J 1 1 W r s r W sb WSW W WSBW Duda et al. Pattern Classification, Ch. 3.8.3 Determinant: the product of the eigenvalues, i.e. the variances in the principal directions. WS W J sb B W = s WS W W W 16

Fisher s Discriminant for Multiple Classes (Cont d) 17

Perceptron Algorithm Classification of x by a perceptron, where 1, a y x f w x f a 0. 1, a 0 Error functions he total number of misclassified patterns Piecewise constant and discontinuous gradient is zero almost everywhere. Perceptron criterion. w w EP ntn, where tn is the target output. nm 18

Perceptron Algorithm (cont d) Stochastic gradient descent algorithm 1 w w E w w t he error from a misclassified pattern is reduced after each iteration. Not imply the overall error is reduced. Perceptron convergence theorem. If there exists an exact solution (i.e. linear separable), the perceptron learning algorithm is guaranteed to find it. However P Learning speed, linearly nonseparable, multiple classes n n 1 w t w t t t w t n n n n n n n n n n 19

Perceptron Algorithm (cont d) (a) (b) (c) (d) 20

Probabilistic Generative Models Computation of posterior probabilities using class-conditional densities and class priors. x and x p C p C p C wo classes p C 1 x k k k p x C1 p C1 x x p C p C p C p C 1 1exp 1 1 2 2 Generalization to K > 2 classes x pc x k a a p Ck p Ck exp ak, p C p C exp a x x where a ln p C p C. j k k k j j j j x 1 1 x p C p C where a ln. p C p C 2 2 he normalized exponential is also known as the softmax function, i.e. smoothed version of the max function. 21

Probabilistic Generative Models -Continuous Inputs- Posterior probabilities when the class-conditional densities are Gaussian. When sharing the same covariance matrix, p 1 1 1 1 x C exp / 2 1/ 2 k. D x μ 2 2 k x μ k p x Ck wo classes 1 x w x w0 p C 1 1 1 1 1 w μ1 μ2 and w0 μ1 μ1 μ2 μ2 ln 2 2 p C p C 1 2 pc 1 x he quadratic terms in x from the exponents are cancelled. he resulting decision boundary is linear in input space. he prior only shifts the decision boundary, i.e. parallel contour. 22

Probabilistic Generative Models -Continuous Inputs (cont d)- Generalization to K classes a x k wk x wk 0 1 w μ μ μ 2 1 1 k k and wk 0 k k ln p Ck When sharing the same covariance matrix, the decision boundaries are linear again. If each class-condition density have its own covariance matrix, we will obtain quadratic functions of x, giving rise to a quadratic discriminant. 23

Probabilistic Generative Models -Maximum Likelihood Solution- Determining the parameters for px Ck and pck using maximum likelihood from a training data set. wo classes x t n N pc pc Data set:,, 1,..., xn, 1 1 xn 1 xn μ1, x, x 1 x μ, p C p C p C N p C p C p C N n 2 2 n 2 n 2 he likelihood function n n t 1 or 0, (denoting C and C, respectively) n N 1 2 Priors: and 1 1 2 t t x, μ, μ, x μ, n 1 x μ, 1 p N N 1 2 n 1 n 2 n1 t n t t1,..., tn 24

Probabilistic Generative Models -Maximum Likelihood Solution (cont d)- wo classes (cont d) Maximization of the likelihood with respect to π. erms of the log likelihood that depend on π. Setting the derivative with respect to π equal to zero. N 1 1 1 tnln 1 tnln 1 N N N tn n1 Maximization with respect to μ 1. N N N N N n1 1 x μ x μ x μ n n 1 n n 1 n 1 2 n1 n1 1 N N 1 1 tn n N and analogously μ2 1 tn N 1 2 n 1 n 1 N 1 2 1 t ln N, t const. μ x x n 25

Probabilistic Generative Models -Maximum Likelihood Solution (cont d)- wo classes (cont d) Maximization of the likelihood with respect to the shared covariance matrix. N N 1 1 t t 2 2 1 x μ x μ n n n 1 n 1 n1 n1 N 1 1 2 2 1 1 t 1 t x μ x μ n n n 2 n 2 n1 n1 N N ln r 2 2 N 1 S N1 N2 S S1 S2 N N 1 S N x μ x μ k k n k n k k n C S Weighted average of the covariance matrices associated with each classes. But not robust to outliers. 26

Probabilistic Generative Models -Discrete Features- Discrete feature values xi 0,1 General distribution would correspond to a 2 D size table. When we have D inputs, the table size grows exponentially with the number of features. Naïve Bayes assumption, conditioned on the class C k D xi x 1 1 p C k i1 ki D ki x x i ln p C p C x ln 1 x ln 1 ln p C k k i ki i ki k i1 Linear with respect to the features as in the continuous features. 27

Bayes Decision Boundaries: 2D -Pattern Classification, Duda et al. pp.42 28

Bayes Decision Boundaries: 3D -Pattern Classification, Duda et al. pp.43 29

Probabilistic Generative Models -Exponential Family- For both Gaussian distributed and discrete inputs he posterior class probabilities are given by Generalized linear models with logistic sigmoid or softmax activation functions. Generalization to the class-conditional densities of the exponential family wo-classes he subclass for which u(x) = x. Exponential family x λ x λ exp λ ux p h g k k k For some scaling parameter s, 1 1 1 p x λ k, s h g k exp k. s s x λ s λ x x λ λ x ln λ ln λ ln ln a g g p C p C 1 2 1 2 1 2 x a p C. 1 1 K-classes a x λ x ln g λ ln p C Linear with respect to x again. k k k k exp ak where pck x. exp a j j 30

3 Approaches for classification Discriminant Functions Probabilistic Generative Models Fit class-conditional densities and class priors separately Apply Bayes theorem to find the posterior class probabilities Posterior probability of a class can be written as Logistic sigmoid acting on a linear function of x (2 classes) Softmax transformation of a linear function of x (Multiclass) he parameters of the densities as well as the class priors can be determined using Maximum Likelihood Probabilistic Discriminative Models Use the functional form of the generalized linear model explicitly Determine the parameters directly using Maximum Likelihood 31

Fixed basis functions Assume fixed nonlinear transformation ransform inputs using a vector of basis functions he resulting decision boundaries will be linear in the feature space 32

Logistic regression Logistic regression model Posterior probability of a class for two-class problem: he number of adjustable parameters (M-dimensional, 2-class) 2 Gaussian class conditional densities (generative model) 2M parameters for means M(M+1)/2 parameters for (shared) covariance matrix Grows quadratically with M Logistic regression (discriminative model) M parameters for Grows linearly with M 33

Logistic regression (Cont d) Determining the parameters using ML Likelihood function: Cross-entropy error function (negative log likelihood) he gradient of the error function w.r.t. w (the same form as the linear regression model) 34

Iterative reweighted least squares Linear regression models in ch.3 ML solution on the assumption of a Gaussian noise leads to a close-form solution, as a consequence of the quadratic dependence of the log likelihood on the parameter w. Logistic regression model No longer a closed-form solution But the error function is concave and has a unique minimum Efficient iterative technique can be used he Newton-Raphson update to minimize a function E(w) Where H is the Hessian matrix, the second derivatives of E(w) 35

Iterative reweighted least squares (Cont d) Sum-of-squares error function: Newton-Raphson update: Cross-entropy error function: Newton-Rhapson update: (iterative reweighted least squares) 36

Multiclass logistic regerssion Posterior probability for multiclass classification We can use ML to determine the parameters directly. Likelihood function using 1-of-K coding scheme Cross-entropy error function for the multiclass classification 37

Multiclass logistic regression (Cont d) he derivative of the error function Same form, the product of error times the basis function. he Hessian matrix IRLS algorithm can also be used for a batch processing 38

Probit regression For a broad range of class-conditional distributions, described by the exponential family, the resulting posterior class probabilities are given by a logistic(or softmax) transformation acting on a linear function of the feature variables. However this is not the case for all choices of class-conditional density It might be worth exploring other types of discriminative probabilistic model 39

Probit regression Noisy threshold model Corresponding activation function when θ is drawn from p(θ) he probit function Sigmoidal shape he generalized linear model based on a probit activation function is known as probit regression. 40

Canonical link functions We have seen that for some models, if we take the derivative of the error function w.r.t the parameter w, it takes the form of the error times the feature vector. Logistic regression model with sigmoid activation function Logistic regression model with softmax activation function his is a general result of assuming a conditional distribution for the target variable from the exponential family, along with a corresponding choice for the activation function known as the canonical link function. 41

Canonical link functions (Cont d) Conditional distributions of the target variable Log likelihood: he derivative of the log likelihood: where he canonical link function: then 42

he Laplace approximation We cannot integrate exactly over the parameter vector since the posterior is no longer Gaussian. he Laplace approximation: find a Gaussian approximation centered on the mode of the distribution. aylor expansion of the logarithm of the target function: Resulting approximated Gaussian distribution: 43

he Laplace approximation (Cont d) M-dimensional case 44

Model comparison and BIC Laplace approximation to the normalization constant Z his result can be used to obtain an approximation to the model evidence, which plays a central role in Bayesian model comparison. Consider a set of models having parameters he log of model evidence can be approximated as Further approximation with some more assumption: Bayesian Information Criterion (BIC) 45

Bayesian Logistic Regression Exact Bayesian inference is intractable. Gaussian prior: Posterior: Log of posterior: Laplace approximation of posterior distribution 46

Predictive distribution Can be obtained by marginalizing w.r.t the posterior distribution p (w t) which is approximated by a Gaussian q(w) where a is a marginal distribution of a Gaussian which is also Gaussian 47

Predictive distribution Resulting variational approximation to the predictive distribution o integrate over a, we make use of the close similarity between the logistic sigmoid function and the probit function hen where Finally we get 48