Slides modified from: PATTERN RECOGNITION CHRISTOPHER M. BISHOP. and: Computer vision: models, learning and inference Simon J.D.

Slides modified from: PATTERN RECOGNITION AND MACHINE LEARNING CHRISTOPHER M. BISHOP and: Computer vision: models, learning and inference. 2011 Simon J.D. Prince

ClassificaLon

Example: Gender ClassiﬁcaLon Incremental logislc regression 300 arc tan basis funclons: Results: 87.5% (humans=95%) Computer vision: models, learning and inference. 2011 Simon J.D. Prince 3

Regression vs. ClassificaLon Regression: x 2 [ 1, 1],t2 [ 1, 1] ClassificaLon (two classes): x 2 [ 1, 1],t2 {0, 1}

Regression vs. ClassificaLon Linear regression model prediclon (y real) ClassificaLon: y in range (0,1) (posterior probabililes) ( ) f: AcLvaLon funclon (nonlinear) Decision surface: y(x) = we wish the model w T x + w 0, to predict d y(x) =f ( w T x + w 0 ) (Generalized linear models) e statistics literature. Th w T x + w 0 =constant ven if the function ( ) is

Decision Theory Inference step Determine either or. Decision step For given x, determine oplmal t.

Minimum MisclassificaLon Rate

Minimum MisclassificaLon Rate We are free to choose the decision rule that assigns each point x to one of the two classes. This defines the decision regions Rk. To minimize integrand: p(x, C k )=p(c k x)p(x) obtained i n restate this result as sayi Assign x to class for which the posterior p(c k x) able x, in must be small is larger!

Minimum Expected Loss Example: classify medical images as cancer or normal Decision Truth

Minimum Expected Loss Regions are chosen to minimize

Reject OpLon

Three strategies 1. Modeling the class-condilonal density for each class C k, and prior, then use Bayes p(c k x) = p(x C k)p(c k ) p(x) Equivalently, model joint distribulon p(x,c k ) (Models of distribulon of input and output are generalve models) 2. First solve the inference problem of determining the posterior class probabililes p(c k x), and then subsequently use decision theory to assign each new x to one of the classes (discriminalve models) 3. Find discriminant funclon that directly maps x to class label

Class-condiLonal density vs. posterior Class-condiLonal densiles Posterior probabililes 5 4 p(x C 2 ) 1.2 1 p(c 1 x) p(c 2 x) class densities 3 2 p(x C 1 ) 0.8 0.6 0.4 1 0.2 0 0 0.2 0.4 0.6 0.8 1 x 0 0 0.2 0.4 0.6 0.8 1 x

Why Separate Inference and Decision? Minimizing risk (loss matrix may change over Lme) Reject oplon Unbalanced class priors Combining models

Several dimensions

Several dimensions y>0 y =0 y<0 x 2 R 2 R 1 Decision surface w x x y(x) w y(x) =w T x + w 0 weight vector bias ve of the bias is C 1 if y(x) 0 therefore called a thres define C 2 otherwise. ation (x) = x 1 w 0 w

Perceptron 1 A linear discriminant model by Rosenblah (1962) y(x) =f ( w T φ(x) ) with f(a) = { +1, a 0 1, a < 0. and feature vector φ(x),

Perceptron 2 Perceptron criterion E P (w) = n M w T φ n t n M is set of misclassified paherns Learning: w (τ+1) = w (τ) η E P (w) =w (τ) + ηφ n t n

Perceptron 3 1 1 0.5 0.5 0 0 0.5 0.5 1 1 0.5 0 0.5 1 1 1 1 0.5 0 0.5 1 1 0.5 0.5 0 0 0.5 0.5 1 1 0.5 0 0.5 1 1 1 0.5 0 0.5 1 Figure 4.7

Fisher s linear discriminant 1 ProjecLng data down to one dimension But how? y = w T x

Fisher s linear discriminant 2 Define class means Try maximize C m 1 = 1 x n, m 2 = 1 x n. N 1 N 2 n C 1 n C 2 asure of the separation of the classes, when proje 4 2 m 2 m 1 = w T (m 2 m 1 ) 0 2 Figure 4.6 2 2 6

m 2 m 1 ). There is still a problem with this approach, however, as illu re 4.6. This shows two classes that are well separated in the origin ional Fisher s space (x 1,x linear 2 ) but that discriminant have considerable3 overlap when projecte e joining their means. This difficulty arises from the strongly nond nces of the class distributions. The idea proposed by Fisher is to ma ion that Instead, will give consider: a large separation ralo of between the projected class class mean ving a small variance variance to within each class class, variance thereby minimizing the class o e projection formula (4.20) transforms the set of labelled data poin abelled set in the one-dimensional J(w) = (m 2space m 1 y. ) 2 The within-class variance rmed data from class C s 2 k is therefore1 given + s2 2 by With s 2 k = n C k (y n m k ) 2 y n = w T x n. We can define the total within-class variance for the t to be simply s 2 1 + s 2 2. The Fisher criterion is defined to be the ratio n-class Called variance Fisher to thecriterion. within-class Maximize variance and it! is given by (m m ) 2

Fisher s linear discriminant 4 Fisher criterion Rewrite Between-class cov. J(w) = (m 2 m 1 ) 2 s 2 1 + s2 2 J(w) = wt S B w w T S W w Within-class cov. S B =(m 2 m 1 )(m 2 m 1 ) T S W = n C 1 (x n m 1 )(x n m 1 ) T + n C 2 (x n m 2 )(x n m 2 ) T.

Fisher s linear discriminant 5 J(w) = wt S B w w T S W w DifferenLate with respect to w (w T S B w)s W w =(w T S W w)s B w. S B w is proporlonal to (m 2 -m 1 ) Thus (Fisher s linear discriminant): w S 1 W (m 2 m 1 )

Fisher s linear discriminant 6 s k = (y n m k ) n C k where y n = w T x n. We can define the total within-clas data set to be simply s 2 1 + s 2 2. The Fisher criterion is defi Fisher s between-class linear discriminant variance to the within-class Fisher Criterion variance and is g w S 1 W (m 2 m 1 ). J(w) = (m 2 m 1 ) 2 s 2 1 + s2 2. We can make the dependence on w explicit by using (4.2 4 4 rewrite the Fisher criterion in the form 2 2 0 0 2 2 2 2 6 2 2 6 Figure 4.6

Least squares for classificalon fails 4 4 2 2 0 0 2 2 4 4 6 6 8 4 2 0 2 4 6 8 8 4 2 0 2 4 6 8 Use logislc regression instead!

ProbabilisLc generalve models Posterior probability for class C 1 can be wrihen p(x C 1 )p(c 1 ) p(c 1 x) = p(x C 1 )p(c 1 )+p(x C 2 )p(c 2 ) 1 = = σ(a) 1+exp( a) with a =ln p(x C 1)p(C 1 ) p(x C 2 )p(c 2 ) d function defined by and the logislc sigmoid funclon σ(a) = 1 1+exp( a)

LogisLc sigmoid funclon 1 0.5 0 5 0 5 σ(a) = 1 1+exp( a)

Gaussian class-condilonal densiles (different means, but equal variances) p(x C k )= { 1 1 (2π) D/2 Σ 1/2 exp 1 } { 2 (x µ k) T Σ 1 (x µ k ) Yields p(c 1 x) =σ(w T x + w 0 ) with w = Σ 1 (µ 1 µ 2 ) w 0 = 1 2 µt 1 Σ 1 µ 1 + 1 2 µt 2 Σ 1 µ 2 +ln p(c 1) p(c 2 ) Linear funclon of x in argument of logislc sigmoid

class-condilonal densiles posterior probability p(c 1 x)\

ProbabilisLc discriminalve models: LogisLc regression - A model of classificalon Posterior probability of class C 1 can be wrihen as a logislc sigmoid aclng on a linear funclon of the feature vector φ σ() is sigmoid ). Here funclon ( ) is the logistic 1sig Also: p(c 1 φ) =y(φ) =σ ( w T φ ) σ(a) = p(c 2 φ) =1 p(c 1 φ) 9). In the terminology of 1+exp( a) M parameter (M(M+5)/2+1 for generalve model)

Maximum likelihood logislc regression (1) N p(t w) = n=1 y t n n {1 y n } 1 t n With t =(t 1,...,t N ) T and y n = p(c 1 φ n ) on by taking the neg e logarithm of th for a data set {φ n,t n }, where t n {0,1} and φ n = φ(x n ), with n = 1,, N Error funclon E(w) = ln p(t w) = with y n = (w T n) N n=1 {t n ln y n +(1 t n ) ln(1 y n )}

Maximum likelihood logislc regression (2) E(w) = ln p(t w) = Gradient of the error funclon with respect to w E(w) = N (y n t n )φ n n=1 StochasLc gradient update rule N {t n ln y n +(1 t n ) ln(1 y n )} n=1 w (τ+1) = w (τ) η E n τ: iteralon n number, number; and η: learning is a learni rate parameter

Example 1 1 x 2 φ 2 0 0.5 1 0 1 0 1 x 1 0 0.5 1 φ 1

LogisLc regression Computer vision: models, learning and inference. 2011 Simon J.D. Prince 35

Two parameters Learning by standard methods (ML,MAP, Bayesian) Inference: Just evaluate Pr(w x) Computer vision: models, learning and inference. 2011 Simon J.D. Prince 36

Neater NotaLon To make notalon easier to handle, we Ahach a 1 to the start of every data vector Ahach the offset to the start of the gradient vector φ New model: Computer vision: models, learning and inference. 2011 Simon J.D. Prince 37

LogisLc regression Computer vision: models, learning and inference. 2011 Simon J.D. Prince 38