Linear Models for Classification

Catherine Lee Anderson figures courtesy of Christopher M. Bishop Department of Computer Science University of Nebraska at Lincoln CSCE 970: Pattern Recognition and Machine Learning

Congradulations!!!! You have just inherited an old house from your great grand aunt on your mothers side twice removed by marriage (and once by divorce). There is an amazing collection of books in the old library (and in almost every other room in the house) containing every thing from old leather bounded tomes and crinkly old parchments to the newer dust jacket bound best sellers and academic text books along with a sizable collection of paper back pulp fiction and DC comic books. Yep, old Aunt Lacee was a collector and you have to clean out the old house before you can sell it.

Your Mission Being the over worked (and underpaid) student that you are, you have limited time to spend on this task... but because you spend your leasuire time listening to NPR ( the Book Guys ) you know that there is money to be made in old books. In other words, you need to quickly determine which books to throw out (or better still recycle), which to sell, and which to hang onto as an investment.

The Task From listening to The Book Guys you know thqt there are many aspects of a book that determine its present value which will help you determine if you wish to toss, sell or keep it. These aspects include: date published author topic condition genre presence of dust jacket number of volume know to be published etc... And to your advantage, you have just completed a course in machine learning so you recognize that what you have is a straight forward classification problem.

Outline 1 Defining the problem Approaches in modeling 2 3 Logistic Regression 4 Modeling conditional class probabilities Bayes Theorem Discrete Features

Outline Defining the problem Approaches in modeling 1 Defining the problem Approaches in modeling 2 3 Logistic Regression 4 Modeling conditional class probabilities Bayes Theorem Discrete Features

Classification Defining the problem Approaches in modeling Problem Components - A group, X, of items, x with common characteristics with specific values assigned to these characterisitcs: values can be nominal, numeric, discrete or continuous. - A set of disjoint classes, into which we wish to place each of the above items. - A function that assigns each item to one and only one of these disjoint classes. Classification: Assigning each item to one discrete class using a function devised specifically for this purpose.

Structure of items Defining the problem Approaches in modeling Each item can be represented as a D-dimensional vector, x = {x 1, x 2,..., x D }, where D is the number of aspects, attributes or value fields used to described the item. Aunt Lacee s Collection - Items to be classified are books, comics and parchments, each of which has a set of values attached to it (type, title, publish date, genre, conditions,...) Sample items from Aunt Lace s Collection: x = { book, Origin of Species, 1872, biology, mint,...} x = { parchment, Magna Carta, 1210, history, brittle,...}

Structure of Classes Defining the problem Approaches in modeling A set of K classes, C = {c 1, c 2,..., c K }, where each x can belong to only one class Input space is divided into K decision areas, each area corresponding to a class Boundries of decision areas are decision boundaries or decision surfaces In linear classification models these surfaces are linear function of x In other words, these surfaces are defined by (D 1)-dimensional hyperplanes within the D-dimensional input space.

Defining the problem Approaches in modeling Example: two dimensions, two classes x 2 y > 0 y = 0 y < 0 R 1 R 2 w x y(x) w x x 1 w 0 w

Structure of Classes Defining the problem Approaches in modeling For Aunt Lacee s book collection, K =3 c 1 = no value - books with no value which will be recycled c 2 = sell immediately - books with immediate cash value such as current text books and best sellers which will be sold quickly. c 3 = keep - these books (or parchments or comics ) have museum quality price tags and require time in order to place properly (for maximum profit). Each item of the collection will be assigned one and only one class. By their very nature, they are mutually exclusive.

Defining the problem Approaches in modeling Representation of a K Class Label Let t be a vector of length K, used to represent a class label. Each element t k of t is 0 except for element i when x c i For Aunt Lacee s collection, the values of t are as follows: t i = {1, 0, 0} indicates x i c 1 and should be recycled. t i = {0, 1, 0} indicates x i c 2 and should be sold. t i = {0, 0, 1} indicates x i c 3 and should be kept. A binary class is a special case, needing only a single dimension vector. t = {0} indicates x i c 0 t = {1} indicates x i c 1

Outline Defining the problem Approaches in modeling 1 Defining the problem Approaches in modeling 2 3 Logistic Regression 4 Modeling conditional class probabilities Bayes Theorem Discrete Features

Approaches to the problem Defining the problem Approaches in modeling Three approaches to finding the function for our classification problem - The simplest approach is a function which directly assigns each x to one c i C Probabilistic - Separates the inference stage from the decision stage. - In the inference stage, the conditional probability distribution, p(c k x), is modeled directly. - In decision stage, class is assigned based on these distributions.

Approaches to the problem Defining the problem Approaches in modeling Probabilistic Generative Functions - Both the class conditional probability distribution, p(x C k ) as well as the prior probabilities p(c k ), are modeled and used to compute posterior probabilites using bayes theorem. p(c k x) = p(x C k) p(c k ) p(x) - This model develops the probability densities of the input space such that new examples can accurately be generated.

Outline 1 Defining the problem Approaches in modeling 2 3 Logistic Regression 4 Modeling conditional class probabilities Bayes Theorem Discrete Features

Two class problem y(x) = w T x + w 0 where w is a weight vector of same dimension D as x. w 0 is the bias or threshold (-w 0 ). An input vector x is assigned to one of the two classes as follows: { c0 if y(x) < 0 x c 1 if y(x) 0 Decision boundary will be a hyperplane in D 1 dimensions

Matrix notation As a reminder of the convention: vectors are column matrices where w 1 w 2 w = so w T = [ ] w 1 w 2 w D. and w D w T x = [ ] w 1 w 2 w D x 1 x 2. x D = w 1x 1 +w 2 x 2 + +w D x D

Example: two dimensions, two classes x 2 y > 0 y = 0 y < 0 R 1 R 2 w x y(x) w x x 1 w 0 w y(x) = w T x + w 0

Multi-class (K > 2) K-class discriminant comprised of K functions of the form y k (x) = w T k + w k0 Assign input vector as follows x c k where k = argmax y k (x) k {1,2...} R j R i R k x A ˆx x B

Learning parameter w Three techniques for learning the parameter of the discriminant function, w. Least Squares Fisher s linear discriminant

Least Squares Once again, each class, C k, has it s own linear model : y k (x) = w T k x + w k0. As a reminder of the convention: vectors are column matrices w 1 w 2 w = so w T = [ ] w 1 w 2 w D. w D

Least Squares, compact notation Let W be a D + 1 K matrix whose columns represent the column vector w: w 0 w 1 w =. w D Let x be a D + 1 1 column matrix (1, x T ) T : x = 1 x 1. x D

Least Squares, compact notation The individual class discriminant functions y k (x) = w T k x + w k0 can be written y(x) = W T x

Least Squares, determining W W is determined by minimizing a sum-of-squares error function whose form is given as: E(w) = 1 2 N {y(x n, w) t n } 2 n=1 Let X be a n (D + 1) matrix representing a training set of n examples. Let T be a n k matrix representing the targets for the n training examples

Least Squares, determining W This yields the expression E D ( W) = 1 {( X 2 Tr W T) T ( X W } T) To minimize, take the derivative with respect to W and set to zero to obtain W = ( X T X) 1 XT T = X T and finally the discriminant function y(x) = W T x = T T ( X ) T x

Least Squares, considerations Under certain conditions, this model will have the property that the elements of y(x)will sum to 1 for any value of x. However, since they are not constraint to lay on the interval (0, 1), meaning that negative numbers and numbers larger than 1 might occur, then the elements cannot be treated as probabilities. Among other disadvantages, this approach has an inappropriate response to outliers.

Least Squares - Response to outliers 4 4 2 2 0 0 2 2 4 4 6 6 8 4 2 0 2 4 6 8 8 4 2 0 2 4 6 8 a) Well separated b) In the presence of outliers several misclassified examples

Outline 1 Defining the problem Approaches in modeling 2 3 Logistic Regression 4 Modeling conditional class probabilities Bayes Theorem Discrete Features

Fisher s Linear Discriminant in concept 4 An approach that reduces the dimensionality of the model by projecting the input vector to a reduced dimension space. Simple example, two dimensional input vectors, projected down to one 2 0 2 2 2 6

Fisher s Linear Discriminant Start with a two class problem: y = w T x whose class mean vectors are given as. m 1 = 1 N 1 Choose w to maximize n C 1 x n and m 2 = 1 N 2 m 2 m 1 = w T (m 2 m 1 ) n C 2 x n

Fisher s Linear Discriminant Maximizing the separation of the mean for each class 4 2 0 2 2 2 6 However, classes still overlap

Fisher s Linear Discriminant Add the condition of minimizing the within-class variance, which is given as s 2 k = n C k (y n m k ) 2 Fishers criterion is based on the maximization of separation of class mean with minimized within-class variance. These two conditioned are captured in the ratio between the variance of the class means and the within-class variance, given by J(w) = (m 2 m 1 ) 2 s 2 1 + S2 2

Fisher s Linear Discriminant Casting this ratio back into terms of the original frame of reference, J(w) = wt S B w w T S B w where S B = (m 2 m 1 )(m 2 m 1 ) T and S w = n C 1 (x n m 1 )(x n m 1 ) T + n C 2 (x n m 2 )(x n m 2 ) T Take derivative with respect to w and set to zero to find minimum.

Fisher s Linear Discriminant derivative : (w T S B w)s w w = (w T S w w)s B w Only direction of w is important w S 1 w (m 2 m 1 ) To make this a discriminant function, y 0 is chosen so that { C1 if y(x) y x 0 C 2 if y(x) < y 0

Fisher s Linear Discriminant Second consideration: minimize the variance within-class 4 2 0 2 2 2 6 Two classes nicely separated in the dimensionally reduced space

Outline 1 Defining the problem Approaches in modeling 2 3 Logistic Regression 4 Modeling conditional class probabilities Bayes Theorem Discrete Features

This model takes the form y(x) = f (w T Φ(x)) - where Φ is a transformation function that creates the feature vector from the input vector. We will use the indentity transformation function to begin our discussion. - where f ( ) is given by { +1, a 0 f (a) = 1, a < 0.

- The binary problem There is a change in target coding: t is now a scaler, taking the values or either 1 or -1. This value is interpreted as the input vector belonging to C 1 if t = 1, else C 2 when t = 1. In considering w, we want which means we want x n C 1 w T Φ(x n ) > 0 and x n C 2 w T Φ(x n ) 0 x n X, w T Φ(x n )t n > 0

- weight update The perceptron error E p (w) = w T Φ n t n n M The perceptron update rule if x is misclassified w (τ+1) = w (τ) η E p (w) = w (τ) + ηφ n t n

- example 1 1 0.5 0.5 0 0 0.5 0.5 1 1 0.5 0 0.5 1 1 1 0.5 0 0.5 1 a) Misclassified example b) w after update

- example 1 1 0.5 0.5 0 0 0.5 0.5 1 1 0.5 0 0.5 1 1 1 0.5 0 0.5 1 a) Next misclassified example b) w after update

- consideration The update rule is guaranteed to reduce the error from that specific example It does not guarantee to reduce the error contribution from the other misclassified examples. Could change previously correctly classified example to misclassified. However, the perceptron convergence theorem does guarantee to find an exact solution if one exists It will find this exact solution in a finite number of steps.

review We have seen

Outline Logistic Regression 1 Defining the problem Approaches in modeling 2 3 Logistic Regression 4 Modeling conditional class probabilities Bayes Theorem Discrete Features

Logistic Regression A Logit - What is it when it s at home A logit is simply the natural log of the odds Odds are simply the ratio of two probabilites In a binary classification problem, the sum of the two posterior probabilities sum to 1 If p(c 1 x) is the probability that x belongs to c 1, then p(c 2 x) = 1 p(c 1 x). So the odds are odds = p(c 1 x) 1 p(c 1 x)

A Logit - What benefits Logistic Regression Example: if an individual is 6 foot tall, then according to census data that probability that the individual is male is 0.9. This makes the probability of being female 1 0.9 = 0.1 The odds on being male are 0.9/0.1 = 9. However, the odds of being female are 0.1/0.9 =.11 The lack of symmetry is unappealing. Intuition would appreciate the odds on being female being the opposite of the odds on being male.

A Logit - linear model Logistic Regression The natural log supplies this symmetry: ln(9.0) = 2.197 ln(0.1) = 2.197 Now, if we assume that the logit is linear with respect to x we have ( ) P logit(p) = ln = a + Bx 1 P where a and B are parameters.

from logit to sigmoid Logistic Regression From Exponentiate both sides ( ) P logit(p) = ln = a + Bx 1 P P = (1 P)e (a+bx) = e (a+bx) Pe (a+bx) P + Pe (a+bx) = e (a+bx) P(1 + e (a+bx) ) = e (a+bx) P = e(a+bx) 1 + e (a+bx) = 1 1 + e (a+bx) where a is the probability when x is zero and B adjusts the rate that the probability changes with x.

The sigmoid Logistic Regression Sigmoid mean S-shaped Also called a squashing function because it maps a very large domain into the relatively small interval (0, 1). 1 0.5 0 5 0 5

The model Logistic Regression The posterior probability of C 1 can be written p(c 1 Φ) = y(φ) = σ(w T 1 Φ) = 1 + e (wt Φ) w must be learned by adjusting its M components ( input vector has length M) weight update: w (τ+1) = w (τ) η E n where N E(w) = (y n t n )Φ n and E n = (y n t n )Φ n n=1

Maximum Likelihood Logistic Regression Maximum likelihood: the probability p(t w), which read the probability of the observed data set given the parameter vector w. This can be calculated by taking the product of individual probabilities of the class assigned to each x n D agreeing with t n. N p(t w) = p(c n = t n x) where t n = {0, 1} and { p(c n = t n x) = n=1 p(c 1 Φ n ) if c n = 1 1 p(c 1 Φ n ) if c n = 0

Maximum Likelihood Logistic Regression Since the target is either 1 or 0, this allows for a mathematically convenient expression for this product p(t w) = N (p(c 1 Φ n )) tn (1 p(c 1 Φ n )) (1 tn) n=1 From p(c 1 Φ) = y(φ) = σ(w T Φ) p(t w) = N (y(φ)) tn (1 y(φ)) (1 tn) n=1

Maximum Likelihood and error Logistic Regression The negative log of the maximum likelihood function is E(w) = ln p(t w) = The gradient of this is N (t n ln y n + (1 t n ) ln(1 y n )) n=1 E(w) = d ( ln p(t w)) dw

Maximum Likelihood and error Logistic Regression = E(w) = N n 1 = = N n=1 N n=1 d dw (t n ln y n + (1 t n ) ln(1 y n )) t n dy n y n dw + (1 t n) d(1 y n ) (1 y n ) dw ( tn Φ n y n (1 y n ) + (1 t ) n) y n (1 y n ) ( Φ ny n (1 y n )) N (t n t n y n y n + t n y n )Φ n = n=1 N (y n t n )Φ n n=1

Logistic regression model Logistic Regression The model based on maximum likelihood p(c 1 Φ) = y(φ) = σ(w T 1 Φ) = 1 + e (wt Φ) weight update based on gradient of maximum likelihood: w (τ+1) = w (τ) η E n = w (τ) + η((y n t n )Φ n )

A new model Logistic Regression The model based on literative reweighted least squares p(c 1 Φ) = y(φ) = σ(w T 1 Φ) = 1 + e (wt Φ) weight update based on a Newton-Raphson iterative optimization scheme: w new = w old H 1 E(w) The Hessian H, is a matrix whose elements are are the second derivatives of E(w) with respect to w. This is an Numerical analysis technique which is an alternative to the first one covered. Faster convergence at the cost of more computationaly expense steps is the trade off.

Outline Modeling conditional class probabilities Bayes Theorem Discrete Features 1 Defining the problem Approaches in modeling 2 3 Logistic Regression 4 Modeling conditional class probabilities Bayes Theorem Discrete Features

Modeling conditional class probabilities Bayes Theorem Discrete Features Probabilistic Generative Models: The approach This approach tends to be more computationally expensive. The training data and any information on the distribution of the training data within input space is used to model the class conditional probabilities. Then using Bayes Theorem, the posterior probability is calculated. The descision of label is made by choosing the maximum posterior probability.

Modeling conditional class probabilities Bayes Theorem Discrete Features Modeling class conditional probabilities with prior probabilities The class conditional probability is given by p(x c k ) and is read the probability of x given the class c k. The prior probability p(c k ) which is the probability of c k independent of any other variable. The probability p(x n, c 1 ) = p(c 1 )p(x n c 1 )

Specific case of Binary label Modeling conditional class probabilities Bayes Theorem Discrete Features Let t n = 1 c 1 and t n = 0 c 2 Let p(c 1 ) = π so p(c 2 ) = 1 π Let each class have a Gaussian class-conditional density with shared covariance matrix. { 1 1 N (x µ, Σ) = (2π) D/2 exp 1 } Σ 1/2 2 (x µ)t Σ 1 (x µ) where µ is a D-dimensional mean vector, Σ is a D D covariance matrix, and Σ is the determinant of Σ.

Specific case of Binary label Modeling conditional class probabilities Bayes Theorem Discrete Features The conditional probabilities for each class are p(c 1 )p(x n c 1 ) = πn (x µ 1, Σ) p(c 2 )p(x n c 2 ) = (1 π)n (x µ 2, Σ) The likelihood function is given by N p(t π, µ 1, µ 2, Σ) = [πn (x n µ 1, Σ)] tn [(1 π)n (x n µ 2, Σ)] 1 tn n=1

Specific case of Binary label Modeling conditional class probabilities Bayes Theorem Discrete Features The error for this is the negative log of the likelihood N (t n ln π + (1 t n ) ln(1 π)) n=1 We minimize this by setting the derviative with respect to π to zero and solve for π. π = 1 N N n=1 t n = N 1 N = N 1 N 1 + N 2

Review of Bayes Theorem Modeling conditional class probabilities Bayes Theorem Discrete Features P(c k x) = P(x c k)p(c k ) P(x) P(x) is the prior probability that x will be observed, meaning the probability of x given no knowledge about which c k is observed. It can be seen that as P(x) increases,p(c k x) decreases, indicating that the higher a probability of an incident independent of any other factor, the lower the probability of that incident dependent on another condition.

Review of Bayes Theorem Modeling conditional class probabilities Bayes Theorem Discrete Features P(x c k ) is the class conditional probability that x will be observed once class c k is observed. Both P(x c k ) and P(c k ) have been modeled Now P(c k x), A posterior probability, can be calculated The label is assigned as the class that generates the Maximum A Posterior (MAP) probability for the input vector P(x c k )P(c k ) c MAP argmax P(c k x) = argmax c k C c k C P(x) c MAP argmax P(x c k )P(c k ) c k C

Discrete feature Values Modeling conditional class probabilities Bayes Theorem Discrete Features Each x is made up of an ordered set of feature values: x = {a 1, a 2,..., a i ) where i = number of attributes. Sample problem: Aunt Lacee s Library x = { book, Origin of Species, 1500-1900, biology, mint,...} Each attribute has as set of allowed values a 1 {book, paperback, parchment, comic}. a 3 {<1200, 1200-1500, 1500-1900, 1900-1930, 1930-1960, 1960-current}

Naïve Bayes assumption Modeling conditional class probabilities Bayes Theorem Discrete Features Assume that the attributes are conditionally independent. P(x c k ) = P(a 1, a 2,..., a i c k ) = i P(a i c k ) where any given P(a i c k ) = number of instances in training set with same a i value and target value c k divided by number of instances with target c k. P(c k ) is the number of instances with target = c k divided by total number of instances. Final label is determined by naïve Bayes c NB = argmax c k {c 1,c 2 } P(c k ) i P(a i c k )

Review Modeling conditional class probabilities Bayes Theorem Discrete Features Discriminant functions Least Squares Probabilistic Logistic Regression - With maximum likelihood error approximation - With Newton-Raphson approach to error approximation Probabilistic Generative Functions Gaussian class conditional probabilites Discrete Attribute values with the Naïve Bayes Classifier