4. Linear Classification. Kai Yu

Size: px

Start display at page:

Download "4. Linear Classification. Kai Yu"

Sophie Phelps
5 years ago
Views:

1 4. Liear Classificatio Kai Y

2 Liear Classifiers A simplest classificatio model Help to derstad oliear models Argably the most sefl classificatio method! 2

3 Liear Classifiers A simplest classificatio model Help to derstad oliear models Argably the most sefl classificatio method! 3

4 Otlie Perceptro Algorithm Spport Vector Machies Logistic Regressio Smmary 4

5 Basic Nero 5

6 Perceptro Node Threshold Logic Uit x 1 w 1 x 2 w 2 b z x w z = 1 if x i w i b i=1 0 if x i w i < b i=1 6

7 Perceptro Learig Algorithm x z x x 1 x 2 t z = 1 if x i w i b i=1 0 if x i w i < b i=1 7

8 First Traiig Istace z = et =.8*.4.3*-.2 =.26 x 1 x 2 t z = 1 if x i w i b i=1 0 if x i w i < b i=1 8

9 Secod Traiig Istace z = et =.4*.4.1*-.2 =.14 x 1 x 2 t z = 1 if x i w i b i=1 0 if x i w i < b i=1 Δw i = (t - z) * c * x i 9

10 The Perceptro Learig Rle Δw ij = c(t j z j ) x i Least pertrbatio priciple Oly chage weights if there is a error (learig from mistakes) The chage amots to x i scaled by a small c Iteratively apply a example from the traiig set Each iteratio throgh the traiig set is a epoch Cotie traiig til total traiig error is less tha epsilo Perceptro Covergece Theorem: Garateed to fid a soltio i fiite time if a soltio exists 10

11 Otlie Perceptro Algorithm Spport Vector Machies Logistic Regressio Smmary 11

12 Spport Vector Machies: Overview A powerfl method for 2-class classificatio Origial idea: Vapik, 1965 for liear classifiers SVM, Corte ad Vapik, 1995 Becomes very hot sice late 90 s Better geeralizatio (less overfittig) Key ideas Use kerel fctio to trasform low dimesioal traiig samples to higher dim (for liear separability problem) Use qadratic programmig (QP) to fid the best classifier bodary hyperplae (for global optima ad)

13 Liear Classifiers x α f y est deotes 1 deotes -1 f(x,w,b) = sig(w. x - b) How wold yo classify this data?

14 Liear Classifiers x α f y est deotes 1 deotes -1 f(x,w,b) = sig(w. x - b) How wold yo classify this data?

15 Liear Classifiers x α f y est deotes 1 deotes -1 f(x,w,b) = sig(w. x - b) How wold yo classify this data?

16 Liear Classifiers x α f y est deotes 1 deotes -1 f(x,w,b) = sig(w. x - b) How wold yo classify this data?

17 Liear Classifiers x α f y est deotes 1 deotes -1 f(x,w,b) = sig(w. x - b) Ay of these wold be fie....bt which is best?

18 Classifier Margi x α f y est deotes 1 deotes -1 f(x,w,b) = sig(w. x - b) Defie the margi of a liear classifier as the width that the bodary cold be icreased by before hittig a datapoit.

19 Maximm Margi x α f y est deotes 1 deotes -1 Liear SVM f(x,w,b) = sig(w. x - b) The maximm margi liear classifier is the liear classifier with maximm margi. This is the simplest kid of SVM (Called a LSVM)

20 Maximm Margi x α f y est deotes 1 deotes -1 Spport Vectors are those datapoits that the margi pshes p agaist Liear SVM f(x,w,b) = sig(w. x - b) The maximm margi liear classifier is the liear classifier with maximm margi. This is the simplest kid of SVM (Called a LSVM)

21 Comptig the margi width M = Margi Width How do we compte M i terms of w ad b? Pls-plae = { x : w. x b = 1 } Mis-plae = { x : w. x b = -1 } 2 Margi M= w.w

22 Why Maximm Margi? deotes 1 deotes -1 Spport Vectors are those datapoits that the margi pshes p agaist f(x,w,b) = sig(w. x - b) The maximm margi liear classifier is the liear classifier with the, m, maximm margi. 1. Ititively this feels safest. 2. If we ve made a small error i the locatio of the bodary this gives s least chace of casig a misclassificatio. This is the simplest kid of SVM (Called a LSVM) 3. It also helps geeralizatio 4. There s some theory that this is a good thig. 5. Empirically it works very very well.

23 Aother way to derstad max margi For f(x)=w.xb, oe way to characterize the smoothess of f(x) is f (x) x = w Therefore, margi measres the smoothess of f(x). As a rle of thmb, machie learig prefers smooth fctios: similar x s shold have similar y s. 8/7/12 23

24 Learig the Maximm Margi Classifier x M = Margi Width = 2 w. w x - Give a gess of w ad b we ca Compte whether all data poits i the correct half-plaes Compte the width of the margi So ow we jst eed to write a program to search the space of w s ad b s to fid the widest margi that matches all the data poits. How?

25 Learig via Qadratic Programmig QP is a well-stdied class of optimizatio algorithms to maximize a qadratic fctio of some real-valed variables sbject to liear costraits. Miimize both w w (to maximize M) ad misclassificatio error

26 Qadratic Programmig 2 arg max d R c T T Fid m m m m m m b a a a b a a a b a a a... : ) ( ) ( 2 )2 ( 1 )1 ( 2) ( 2) ( 2 2)2 ( 1 2)1 ( 1) ( 1) ( 2 1)2 ( 1 1)1 (... : e m m e e e m m m m b a a a b a a a b a a a = = = Ad sbject to additioal liear ieqality costraits e additioal liear eqality costraits Qadratic criterio Sbject to

27 Learig withot Noise What shold or qadratic optimizatio criterio be? Miimize 1 2 w. w C R k = 1 å k What costraits shold be? w. x k b >= 1 if y k = 1 w. x k b <= -1 if y k = -1 27

28 Solvig the Optimizatio Problem Fid w ad b sch that Φ(w) =w T w is miimized ad for all (x i, y i ), i=1.. : y i (w T x i b) 1 Need to optimize a qadratic fctio sbject to liear costraits. Qadratic optimizatio problems are a well-kow class of mathematical programmig problems for which several (o-trivial) algorithms exist. The soltio ivolves costrctig a dal problem where a Lagrage mltiplier α i is associated with every ieqality costrait i the primal (origial) problem: Fid α 1 α sch that Q(α) =Σα i - ½ΣΣα i α j y i y j x it x j is maximized ad (1) Σα i y i = 0 (2) α i 0 for all α i

29 Soft Margi Classificatio What if the traiig set is ot liearly separable? Slack variables ε i ca be added to allow misclassificatio of difficlt or oisy examples, resltig so-called soft margi. ε i ε i

30 Learig Maximm Margi with Noise ε 2 ε 11 ε 7 M = 2 w.w Give gess of w, b we ca Compte sm of distaces of poits to their correct zoes Compte the margi width Assme R data poits, each (x k,y k ) where y k = /- 1 What shold or qadratic optimizatio criterio be? Miimize R 1 2 w.w C ε k k=1 ε k = distace of error poits to their correct place How may costraits will we have? R What shold they be? w. x k b >= 1-ε k if y k = 1 w. x k b <= -1ε k if y k = -1 ε k >= 0 for all k

31 Soft Margi Classificatio Mathematically The old formlatio: Fid w ad b sch that Φ(w) =w T w is miimized ad for all (x i,y i ), i=1.. : y i (w T x i b) 1 Modified formlatio icorporates slack variables: Fid w ad b sch that Φ(w) =w T w CΣξ i is miimized ad for all (x i,y i ), i=1.. : y i (w T x i b) 1 ξ i,, ξ i 0 Parameter C ca be viewed as a way to cotrol overfittig: it trades off the relative importace of maximizig the margi ad fittig the traiig data.

32 Hige loss The soft margi SVM is eqivalet to applyig a hige loss L(w, b) := X max(1 y i (w T x i b), 0). Eqivalet costraied optimizatio formlatio i=1 mi {w,b} L(w,b)λ w 2 λ=0.5/c 2 1 L Hige Loss 0-1 Loss y.(w.xb) /7/12 32

33 Otlie Perceptro Algorithm Spport Vector Machies Logistic Regressio Smmary 33

34 Logistic Regressio Biary respose: Y ={1, -1} where p i is the probability of Y i =1 Likelihood Y i X i Berolli(p i ) p i = 1 1 exp( W T X i ) Y P (Y i X i )= i=1 Y 1 1 exp( Y i Xi T W ) i=1 8/7/12 34

35 Logistic regressio Maximm likelihood estimator (MLE) becomes logistic regressio X X mi l p(y i X i )= l(1 exp(y i Xi T W )) W i=1 i=1 Covex optimizatio problem i terms of W MAP is reglarized logistic regressio mi W X l(1 exp(y i Xi T W )) kw k 2 i=1 8/7/12 35

36 Otlie Perceptro Algorithm Spport Vector Machies Logistic Regressio Smmary 36

37 Geeral formlatio of liear classifiers mi {w,b} L(w,b)λ w 2 The objective: empirical loss reglarizatio The reglarizatio term is sally L2 orm, bt also ofte L1 orm for sparse models The empirical loss ca be hige loss, logistic loss, smooth hige loss, or yor ow ivetio 8/7/12 37

38 Differet loss fctios classificatio error SVM logistic regressio least sqares expoetial modified hber loss f(x)*y 8/7/12 38

39 Commets o liear classifiers Choosig logistic regressio or SVM? Not that big differet! Logistic regressio otpts probabilities. Smooth loss fctios, e.g. logistic Easier to optimize (LBFGS ) Hige à Differetiable hige, the yo ca easily have yor ow implemetatio of SVMs Try differet loss fctios ad reglarizatio terms Deped o data, e.g., may otliers? Irrelevat featres? strctre i otpt? 8/7/12 39

40 Liear classifiers i practice ad research Liear classifiers are simple ad scalable Traiig complexity is liear or sb-liear w.r.t. data size Classificatio is simple to implemet State-of-the-art for texts, images, Their sccess depeds o qality of featres A sefl practice: se a lot of featres, lear sparse w Hot topic: large-scale liear classificatio May data, may featres, may classes Stochastic optimizatio Parallel implemetatio 8/7/12 40

Linear Classifiers III

Linear Classifiers III Uiversität Potsdam Istitut für Iformatik Lehrstuhl Maschielles Lere Liear Classifiers III Blaie Nelso, Tobias Scheffer Cotets Classificatio Problem Bayesia Classifier Decisio Liear Classifiers, MAP Models