LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

LINEAR MODELS FOR CLASSIFICATION

Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification, the input variable x is still continuous, but the target variable is discrete. In the simplest case, t can have only 2 values.

Example Problem 3 Animal or Vegetable?

4 Linear models for classification separate input vectors into classes using linear decision boundaries. Example: Input vector x Two discrete classes C 1 and C 2 4 2 0!2!4!6!8!4!2 0 2 4 6 8

Discriminant Functions 5 A linear discriminant function y(x) = f ( w t x + w ) 0 maps a real input vector x to a scalar value y(x). f ( ) is called an activation function.

Outline 6 Linear activation functions Least-squares formulation Fisher s linear discriminant Nonlinear activation functions Probabilistic generative models Probabilistic discriminative models Logistic regression Bayesian logistic regression

Two Class Discriminant Function 7 Let f ( ) be the identity: y>0 y =0 y<0 x 2 R 2 R 1 y(x) = w t x + w 0 y(x) 0 x assigned to C 1 w x y(x) w y(x) < 0 x assigned to C 2 Thus y(x) = 0 defines the decision boundary x w 0 w x 1

K>2 Classes 8 Idea #1: Just use K-1 discriminant functions, each of which separates one class C k from the rest. (Oneversus-the-rest classifier.) Problem: Ambiguous regions? R 1 R 2 C 1 R 3 not C 1 C 2 not C 2

K>2 Classes 9 Idea #2: Use K(K-1)/2 discriminant functions, each of which separates two classes C j, C k from each other. (One-versus-one classifier.) Each point classified by majority vote. Problem: Ambiguous regions C 1 C 3 R 1 C 1? R 3 C 2 R 2 C 3 C 2

K>2 Classes 10 Idea #3: Use K discriminant functions y k (x) Use the magnitude of y k (x), not just the sign. y k (x) = w k t x + w k 0 x assigned to C k if y k (x) > y j (x) j k Decision boundary y k (x) = y j (x) ( w k w j ) t x + ( w k 0 w j 0 ) = 0 Results in decision regions that are simply-connected and convex. R i R j R k x A x x B

Learning the Parameters 11 Method #1: Least Squares y k (x) = w k t x + w k 0 y(x) = W t x where x = (1,x t ) t ( ) t W is a (D + 1) K matrix whose kth column is w k = w 0,w k t

Learning the Parameters 12 Method #1: Least Squares y(x) = W t x Training dataset ( x n,t n ), n = 1,,N where we use the 1-of-K coding scheme for t n Let T be the N K matrix whose n th row is t n t Let X be the N (D + 1) matrix whose n th row is x n t We define the error as E D ( W ) = 1 2 Tr X W T {( ) t ( X W T )} Setting derivative wrt W yields: W = ( X t X ) 1 X t T = X T

Fisher s Linear Discriminant 13 Another way to view linear discriminants: find the 1D subspace that maximizes the separation between the two classes. Let m 1 = 1 N 1 x n, m 2 = 1 x n C 1 N n 2 n C 2 For example, might choose w to maximize w t ( m 2 m 1 ), subject to w = 1 This leads to w m 2 m 1 4 2 However, if conditional distributions are not isotropic, this is typically not optimal. 0!2!2 2 6

Fisher s Linear Discriminant 14 Let m 1 = w t m 1, m 2 = w t m 2 be the conditional means on the 1D subspace. Let s k 2 = ( y n m k ) 2 be the within-class variance on the subspace for class C k n C k ( The Fisher criterion is then J(w) = m m 2 1) 2 s 2 2 1 + s 2 This can be rewritten as J(w) = w t S B w w t S W w where S B = m 2 m 1 and ( )( m 2 m 1 ) t is the between-class variance S W = x n m 1 + x n m 2 is the within-class variance n C 1 ( )( x n m 1 ) t n C 2 J(w) is maximized for w S 1 W ( m 2 m 1 ) ( )( x n m 2 ) t 4 2 0!2!2 2 6

15 Connection between Least-Squares and FLD Change coding scheme to t n = N N 1 for C 1 t n = N N 2 for C 2 Then one can show that the ML w satisfies ( ) w S W 1 m 2 m 1

Least Squares Classifier 16 Problem #1: Sensitivity to outliers 4 4 2 2 0 0!2!2!4!4!6!6!8!4!2 0 2 4 6 8!8!4!2 0 2 4 6 8

Least Squares Classifier 17 Problem #2: Linear activation function is not a good fit to binary data. This can lead to problems. 6 4 2 0!2!4!6!6!4!2 0 2 4 6

Outline 18 Linear activation functions Least-squares formulation Fisher s linear discriminant Nonlinear activation functions Probabilistic generative models Probabilistic discriminative models Logistic regression Bayesian logistic regression

Probabilistic Generative Models 19 Consider first K=2: By Bayes' equation, the posterior for class C 1 can be written : p ( C 1 x) = = p ( x C 1 ) p ( C 1 ) p ( x C 1 ) p ( C 1 ) + p ( x C 2 ) p ( C 2 ) 1 1+ exp( a) = σ(a) where a = log p x C 1 p x C 2 ( ) p ( C 1 ) ( ) p ( C 2 ) and σ(a) is the logistic sigmoid function 1 0.5 σ(a) 0!5 0 5 a

Probabilistic Generative Models 20 Let's assume that the input vector x is multivariate normal, when conditioned upon the class C k, and that the covariance is the same for all classes : p ( x C k ) = 1 ( 2π ) exp 1 D / 2 Σ 1/ 2 2 x µ k ( ) t Σ 1 ( x µ k ) Then we have that p ( C 1 x) = σ ( w t x + w 0 ) where ( ) w = Σ 1 µ 1 µ 2 ( ) ( ) w 0 = 1 2 µt 1 Σ 1 µ 1 + 1 2 µt 2 Σ 1 µ 2 + log p C 1 p C 2 Thus we have a generalized linear model, and the decision surfaces will be hyperplanes in the input space.

Probabilistic Generative Models 21 This result generalizes to K > 2 classes : ( ) = p ( x C k ) p ( C k ) p C k x = exp a k exp a j where j ( ) j ( ) a k = log p x C k ( ) p ( C j ) p x C j softmax ( ( ) p ( C k )) Then we have that a k (x) = w k t x + w k 0 where w k = Σ 1 µ k w k 0 = 1 2 µt k Σ 1 µ k + log p ( C k )

Non-Constant Covariance 22 If the class-conditional covariances are different, the generative decision boundaries are in general quadratic. 2.5 2 1.5 1 0.5 0!0.5!1!1.5!2!2.5!2!1 0 1 2

ML for Probabilistic Generative Model 23 Let t n = 1 denote Class 1, t n = 0 denote Class 2. Let π = p C 1 ( ) so that 1- π = p ( C 2 ) Then the ML estimates for the parameters are: π = N 1 N 1 + N 2 N µ 1 = 1 N 1 t n x n n=1 N ( ) x n µ 2 = 1 N 2 1 t n n=1 Σ = N 1 N S + N 2 1 N S 2 where S 1 = 1 N 1 x n µ 1 and n C 1 S 2 = 1 N 2 x n µ 2 n C 2 ( )( x n µ 1 ) t ( )( x n µ 2 ) t

Probabilistic Discriminative Models 24 An alternative to the generative approach is to model the dependence of the target variable t on the input vector x directly, using the activation function f. One big advantage is that there will typically be fewer parameters to determine.

Logistic Regression (K = 2) 25 ( ) p ( C 1 φ) = y ( φ) = σ w t φ p ( C 2 φ) = 1 p ( C 1 φ) p ( C 1 φ) = y ( φ) = σ ( w t φ) where σ(a) = #" 1 1 exp( a) w t φ x 1

Logistic Regression 26 ( ) = y ( φ) = σ w t φ ( ) = 1 p ( C 1 φ) p C 1 φ p C 2 φ where 1 σ(a) = 1 exp( a) ( ) Number of parameters Logistic regression: M Generative model: 2M + M(M+1)/2 + 1 = M(M+5)/2+1

ML for Logistic Regression 27 ( ) t = y n { n 1 y } 1 t n n p t w N where t = t 1,,t N n=1 ( ) t and y n = p ( C 1 φ ) n We define the error function to be E(w) = log p ( t w) Given y n = σ ( a ) n and a n = w t φ n, one can show that E(w) = N ( y n t n )φ n n=1 Unfortunately, there is no closed form solution for w.

ML for Logistic Regression: 28 Iterative Reweighted Least Squares Although there is no closed form solution for the ML estimate of w, fortunately, the error function is convex. Thus an appropriate iterative method is guaranteed to find the exact solution. A good method is to use a local quadratic approximation to the log likelihood function (Newton- Raphson update): w (new) = w (old ) H 1 E(w) where H is the Hessian matrix of E(w)

ML for Logistic Regression 29 w (new) = w (old ) H 1 E(w) where H is the Hessian matrix of E(w) : H = Φ t RΦ where R is the N N diagonal weight matrix with R nn = y n ( 1 y n ) (Note that, since R nn 0, R is positive semi-definite, and hence H is positive semi-definite Thus E(w) is convex.) Thus ( ) w new = w (old ) ( Φ t RΦ) 1 Φ t y t

ML for Logistic Regression 30 Iterative Reweighted Least Squares ( ) = y ( φ) = σ w t φ p C 1 φ ( )! $ & % w 1 " # w 2 w t φ

Bayesian Logistic Regression 31 We can make logistic regression Bayesian by applying a prior over w: p(w) = N(w m 0,S 0 ) Unfortunately, the posterior over w will not be normal for logistic regression, and hence we cannot integrate over it analytically. This means that we cannot do Bayesian prediction analytically. However, there are methods for approximating the posterior that allow us to do approximate Bayesian prediction.

The Laplace Approximation 32 In the Laplace approximation, we approximate the log of a distribution by a local, second order (quadratic) form, centred at the mode. This corresponds to a normal approximation to the distribution, with mean given by the mode of the original distribution precision matrix given by the Hessian of the negative log of the distribution 0.8 p(z) 40 log p(z) 0.6 30 0.4 20 0.2 10 0!2!1 0 1 2 3 4 0!2!1 0 1 2 3 4

Bayesian Logistic Regression 33 When applied to the posterior over w in logistic regression, this yields p(w) q(w) = N ( w w MAP,S ) N where N S 1 = N S 1+ y ( 1 y 0 n n )φ n φ t n n=1

Prediction 34 Bayesian prediction requires that we integrate out this posterior over w: ( ) = p ( C 1 φ,w) p C 1 φ,t ( ) p(w t)dw σ w t φ q(w)dw This integral is not tractable analytically. However, approximation of the sigmoid function σ( ) by the inverse probit (cumulative normal) function yields an analytical solution: p C 1 φ,t ( ( ) µ a ), ( ) σ κ σ a 2 t where µ a = w MAP φ, σ 2 a = φ t S N φ and κ ( 2 σ ) a = ( 1+ πσ 2 a / 8) 1/ 2

Bayesian Logistic Regression 35 $!"! ( ) σ w t φ q(w)dw σ ( κ ( 2 σ ) a µ ) a Residual '( )( *( $ $!"!&!"!#!!! %& & %& & %& & This last approximation is excellent!