Slides modified from: PATTERN RECOGNITION CHRISTOPHER M. BISHOP. and: Computer vision: models, learning and inference Simon J.D.

Size: px

Start display at page:

Download "Slides modified from: PATTERN RECOGNITION CHRISTOPHER M. BISHOP. and: Computer vision: models, learning and inference Simon J.D."

Brittney Chambers
6 years ago
Views:

1 Slides modified from: PATTERN RECOGNITION AND MACHINE LEARNING CHRISTOPHER M. BISHOP and: Computer vision: models, learning and inference Simon J.D. Prince

2 ClassificaLon

3 Example: Gender ClassiﬁcaLon Incremental logislc regression 300 arc tan basis funclons: Results: 87.5% (humans=95%) Computer vision: models, learning and inference Simon J.D. Prince 3

4 Regression vs. ClassificaLon Regression: x 2 [ 1, 1],t2 [ 1, 1] ClassificaLon (two classes): x 2 [ 1, 1],t2 {0, 1}

5 Regression vs. ClassificaLon Linear regression model prediclon (y real) ClassificaLon: y in range (0,1) (posterior probabililes) ( ) f: AcLvaLon funclon (nonlinear) Decision surface: y(x) = we wish the model w T x + w 0, to predict d y(x) =f ( w T x + w 0 ) (Generalized linear models) e statistics literature. Th w T x + w 0 =constant ven if the function ( ) is

6 Decision Theory Inference step Determine either or. Decision step For given x, determine oplmal t.

7 Minimum MisclassificaLon Rate

Minimum MisclassificaLon Rate We are free to choose the decision rule that assigns each point x to one of the two classes. This defines the decision regions Rk.

8 Minimum MisclassificaLon Rate We are free to choose the decision rule that assigns each point x to one of the two classes. This defines the decision regions Rk. To minimize integrand: p(x, C k )=p(c k x)p(x) obtained i n restate this result as sayi Assign x to class for which the posterior p(c k x) able x, in must be small is larger!

9 Minimum Expected Loss Example: classify medical images as cancer or normal Decision Truth

10 Minimum Expected Loss Regions are chosen to minimize

11 Reject OpLon

12 Three strategies 1. Modeling the class-condilonal density for each class C k, and prior, then use Bayes p(c k x) = p(x C k)p(c k ) p(x) Equivalently, model joint distribulon p(x,c k ) (Models of distribulon of input and output are generalve models) 2. First solve the inference problem of determining the posterior class probabililes p(c k x), and then subsequently use decision theory to assign each new x to one of the classes (discriminalve models) 3. Find discriminant funclon that directly maps x to class label

13 Class-condiLonal density vs. posterior Class-condiLonal densiles Posterior probabililes 5 4 p(x C 2 ) p(c 1 x) p(c 2 x) class densities 3 2 p(x C 1 ) x x

14 Why Separate Inference and Decision? Minimizing risk (loss matrix may change over Lme) Reject oplon Unbalanced class priors Combining models

15 Several dimensions

16 Several dimensions y>0 y =0 y<0 x 2 R 2 R 1 Decision surface w x x y(x) w y(x) =w T x + w 0 weight vector bias ve of the bias is C 1 if y(x) 0 therefore called a thres define C 2 otherwise. ation (x) = x 1 w 0 w

17 Perceptron 1 A linear discriminant model by Rosenblah (1962) y(x) =f ( w T φ(x) ) with f(a) = { +1, a 0 1, a < 0. and feature vector φ(x),

18 Perceptron 2 Perceptron criterion E P (w) = n M w T φ n t n M is set of misclassified paherns Learning: w (τ+1) = w (τ) η E P (w) =w (τ) + ηφ n t n

19 Perceptron Figure 4.7

20 Fisher s linear discriminant 1 ProjecLng data down to one dimension But how? y = w T x

21 Fisher s linear discriminant 2 Define class means Try maximize C m 1 = 1 x n, m 2 = 1 x n. N 1 N 2 n C 1 n C 2 asure of the separation of the classes, when proje 4 2 m 2 m 1 = w T (m 2 m 1 ) 0 2 Figure

22 m 2 m 1 ). There is still a problem with this approach, however, as illu re 4.6. This shows two classes that are well separated in the origin ional Fisher s space (x 1,x linear 2 ) but that discriminant have considerable3 overlap when projecte e joining their means. This difficulty arises from the strongly nond nces of the class distributions. The idea proposed by Fisher is to ma ion that Instead, will give consider: a large separation ralo of between the projected class class mean ving a small variance variance to within each class class, variance thereby minimizing the class o e projection formula (4.20) transforms the set of labelled data poin abelled set in the one-dimensional J(w) = (m 2space m 1 y. ) 2 The within-class variance rmed data from class C s 2 k is therefore1 given + s2 2 by With s 2 k = n C k (y n m k ) 2 y n = w T x n. We can define the total within-class variance for the t to be simply s s 2 2. The Fisher criterion is defined to be the ratio n-class Called variance Fisher to thecriterion. within-class Maximize variance and it! is given by (m m ) 2

23 Fisher s linear discriminant 4 Fisher criterion Rewrite Between-class cov. J(w) = (m 2 m 1 ) 2 s s2 2 J(w) = wt S B w w T S W w Within-class cov. S B =(m 2 m 1 )(m 2 m 1 ) T S W = n C 1 (x n m 1 )(x n m 1 ) T + n C 2 (x n m 2 )(x n m 2 ) T.

24 Fisher s linear discriminant 5 J(w) = wt S B w w T S W w DifferenLate with respect to w (w T S B w)s W w =(w T S W w)s B w. S B w is proporlonal to (m 2 -m 1 ) Thus (Fisher s linear discriminant): w S 1 W (m 2 m 1 )

25 Fisher s linear discriminant 6 s k = (y n m k ) n C k where y n = w T x n. We can define the total within-clas data set to be simply s s 2 2. The Fisher criterion is defi Fisher s between-class linear discriminant variance to the within-class Fisher Criterion variance and is g w S 1 W (m 2 m 1 ). J(w) = (m 2 m 1 ) 2 s s2 2. We can make the dependence on w explicit by using ( rewrite the Fisher criterion in the form Figure 4.6

26 Least squares for classificalon fails Use logislc regression instead!

27 ProbabilisLc generalve models Posterior probability for class C 1 can be wrihen p(x C 1 )p(c 1 ) p(c 1 x) = p(x C 1 )p(c 1 )+p(x C 2 )p(c 2 ) 1 = = σ(a) 1+exp( a) with a =ln p(x C 1)p(C 1 ) p(x C 2 )p(c 2 ) d function defined by and the logislc sigmoid funclon σ(a) = 1 1+exp( a)

28 LogisLc sigmoid funclon σ(a) = 1 1+exp( a)

29 Gaussian class-condilonal densiles (different means, but equal variances) p(x C k )= { 1 1 (2π) D/2 Σ 1/2 exp 1 } { 2 (x µ k) T Σ 1 (x µ k ) Yields p(c 1 x) =σ(w T x + w 0 ) with w = Σ 1 (µ 1 µ 2 ) w 0 = 1 2 µt 1 Σ 1 µ µt 2 Σ 1 µ 2 +ln p(c 1) p(c 2 ) Linear funclon of x in argument of logislc sigmoid

30 class-condilonal densiles posterior probability p(c 1 x)\

31 ProbabilisLc discriminalve models: LogisLc regression - A model of classificalon Posterior probability of class C 1 can be wrihen as a logislc sigmoid aclng on a linear funclon of the feature vector φ σ() is sigmoid ). Here funclon ( ) is the logistic 1sig Also: p(c 1 φ) =y(φ) =σ ( w T φ ) σ(a) = p(c 2 φ) =1 p(c 1 φ) 9). In the terminology of 1+exp( a) M parameter (M(M+5)/2+1 for generalve model)

32 Maximum likelihood logislc regression (1) N p(t w) = n=1 y t n n {1 y n } 1 t n With t =(t 1,...,t N ) T and y n = p(c 1 φ n ) on by taking the neg e logarithm of th for a data set {φ n,t n }, where t n {0,1} and φ n = φ(x n ), with n = 1,, N Error funclon E(w) = ln p(t w) = with y n = (w T n) N n=1 {t n ln y n +(1 t n ) ln(1 y n )}

33 Maximum likelihood logislc regression (2) E(w) = ln p(t w) = Gradient of the error funclon with respect to w E(w) = N (y n t n )φ n n=1 StochasLc gradient update rule N {t n ln y n +(1 t n ) ln(1 y n )} n=1 w (τ+1) = w (τ) η E n τ: iteralon n number, number; and η: learning is a learni rate parameter

34 Example 1 1 x 2 φ x φ 1

35 LogisLc regression Computer vision: models, learning and inference Simon J.D. Prince 35

36 Two parameters Learning by standard methods (ML,MAP, Bayesian) Inference: Just evaluate Pr(w x) Computer vision: models, learning and inference Simon J.D. Prince 36

37 Neater NotaLon To make notalon easier to handle, we Ahach a 1 to the start of every data vector Ahach the offset to the start of the gradient vector φ New model: Computer vision: models, learning and inference Simon J.D. Prince 37

38 LogisLc regression Computer vision: models, learning and inference Simon J.D. Prince 38

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 305 Part VII