Intelligent Systems I 08 SVM

Size: px

Start display at page:

Download "Intelligent Systems I 08 SVM"

Janel Agnes Wood
5 years ago
Views:

1 Itelliget Systems I 08 SVM Stefa Harmelig & Philipp Heig 12. December 2013 Max Plack Istitute for Itelliget Systems Dptmt. of Empirical Iferece 1 / 30

2 Your feeback Ejoye most Laplace approximatio gettig away from Bayes ituitio, geometry of SVM explaatios o the boar Ejoye least Laplace approximatio o break too much text o the slies most of it lack of syc with Philipp s explaatios too fast over the ormal/o-ormal stuff of regressio/classificatio 2 / 30

3 Support Vector Machie (1) Classificatio problem: give patters x 1,..., x N R M a class labels y 1,..., y N {+1, 1} fi a rule that preicts the label y of ew patter x. We cosier three cases: 1. Liearly separable case 2. Liearly o-separable case 3. No-liear case (kerel trick) 3 / 30

4 Support Vector Machie (2) separable case Schölkopf/Smola, 1.4 Classificatio problem: give patters x 1,..., x N R M a class labels y 1,..., y N {+1, 1} fi a rule that preicts the label y of ew patter x. Cosier the class of hyperplaes: <w, x>+b = w T x + b = 0 where w R M, b R Decisio fuctios base o the hyperplaes: f(x) = sg(<w, x>+b) = { +1 if <w, x>+b 0 1 otherwise 4 / 30

5 Support Vector Machie (3) separable case Schölkopf/Smola, 1.4, Rasmusse/Williams 6.4 Fuctioal margi of sigle example: γ i = y i (<w, x i >+b) γ i > 0 iff f(x i ) = y i, i.e. correctly classifie sg(<w, x>+b) = sg(c<w, x>+cb) for ay c > 0 i.e. scalig of (w, b) is arbitrary for a separatig (w, b) with γ i > 0, assume a scalig such that mi i γ i = 1 such a scale (w, b) is calle caoical form of the hyperplae Geometrical margi of sigle example: γ i = γ i / w γ i is the istace of x i to the hyperplae 5 / 30

6 Support Vector Machie (4) separable case Schölkopf/Smola, 1.4, Rasmusse/Williams 6.4 Geometrical margi of ataset: γ = mi γ i = mi γ i / w i i for a caoical separatig hyperplae the geometrical margi is 1/ w Fi a caoical separatig hyperplae with maximal margi: miimize 1 2 w 2 over (w, b) subject to y i (<w, x i >+b) 1 for all i replacig 1 with 0 oes ot work, cosier scalig ow (w, b) 1 coul be replace by ay strictly positive umber, it fixes the scalig of (w, b) a costraie optimizatio problem ca apply covex optimizatio, quaratic programmig 6 / 30

7 Support Vector Machie (5) separable case Schölkopf/Smola, 1.4 Costraie optimizatio problem: miimize 1 2 w 2 over (w, b) subject to y i (<w, x i >+b) 1 for all i The Lagragia: N L(w, b, α) = 1 2 w 2 α i (y i (<w, x i >+b) 1) primal variables (w, b), ual variables α 0 (aka Lagrage multipliers) miimize i (w, b) a maximize i α (sale poit) i=1 7 / 30

8 Support Vector Machie (6) separable case Schölkopf/Smola, 1.4 The Lagragia: For the sale poit: N L(w, b, α) = 1 2 w 2 α i (y i (<w, x i >+b) 1) i=1 N w L(w, b, α) = w α i y i x i = 0, i=1 N b L(w, b, α) = α i y i = 0 i=1 Dual problem: maximize N α i 1 N α i α j y i y j <x i, x j > i=1 2 i,j=1 subject to α 0 a N i=1 α i y i = 0 8 / 30

9 Support Vector Machie (7) separable case Schölkopf/Smola, 1.4 SVM algorithm: give traiig ata (x 1, y 1 ),..., (x N, y N ) solve the ual problem to obtai α Decisio fuctio: KKT coitio: N f(x) = sg(<w, x>+b) = sg ( α i y i <x, x i >+b) i=1 α i (y i (<w, x i >+b) 1) = 0 for all i for y i (<w, x i >+b) 1 > 0, we have α i = 0 for y i (<w, x i >+b) 1 = 0, we have α i > 0 (aka support vectors) 9 / 30

10 Support Vector Machie (8) o-separable case Schölkopf/Smola, 1.5 Separable case: miimize 1 2 w 2 over (w, b) subject to y i (<w, x i >+b) 1 for all i No-separable case: miimize 1 2 w 2 + C ξ i over (w, b, ξ) subject to y i (<w, x i >+b) 1 ξ i for all i ξ 0 relax the problem by itroucig slack variables ξ 0 ew hyperparameter C that has to be tue (e.g. by cross valiatio) N i=1 10 / 30

11 Support Vector Machie (9) o-separable case Schölkopf/Smola, 1.5 No-separable case: Dual problem: miimize 1 2 w 2 + C ξ i over (w, b, ξ) N i=1 subject to y i (<w, x i >+b) 1 ξ i for all i maximize ξ 0 N α i 1 N α i α j y i y j <x i, x j > i=1 2 i,j=1 subject to C α 0 a N i=1 α i y i = 0 oly ifferece to separable case: upper bou C o the α i this limits the ifluece of a sigle example 11 / 30

12 Support Vector Machie (10) o-liear case Schölkopf/Smola, 1.5 Liear problem: No-liear problem: maximize N α i 1 N α i α j y i y j <x i, x j > i=1 2 i,j=1 subject to C α 0 a maximize N i=1 α i y i = 0 N α i 1 N α i α j y i y j k(x i, x j ) i=1 2 i,j=1 subject to C α 0 a N i=1 α i y i = 0 replace ier prouct<x i, x j >with a kerel fuctio k(x i, x j ) aka covariace fuctio, aka positive efiite fuctio 12 / 30

13 The kerel trick see also Schölkopf, Mika, Burgers, Kirsch, Müller, Rätsch, Smola, Iput Space vs. Feature Space i Kerel-Base Methos, / 30

14 From iput space to feature space Problem 8.1 Give traiig ata, cosistig of ata poits x 1,..., x X a class labels y 1,..., y Y = {+1, 1}, lear a fuctio f X Y that preicts the label y 0 of a ew test poit x 0 most correctly. Liearly separable: X is calle iput space, e.g. R 2 for liearly separable classes lear a liear fuctio Not liearly separable: f(x) = w, x + b = w T x + b. e.g. class 1 close at the origi, class -1 further away. iea: map the ata to ew features, e.g. their polar cooriates x = [ x 1 x ] [ x 2 2 x 2 arcta(x 2 /x 1 ) ] the polar cooriates is a example of a feature space, where the classes are liearly separable 14 / 30

15 Feature maps a ier proucts Feature map: iput space feature space Φ R 2 R 3 [ x x ] x 2 x x1 x 2 Dot prouct is essetial to compare ata poits: e.g. istace Dot prouct i feature space: x x 2 = x, x + x, x 2 x, x Φ(x), Φ(x ) = Φ(x) T Φ(x ) x 2 T 1 x = x x1 x x x 1 x 2 = x 2 1x x 2 2x x 1 x 2 x 1x 2 = (x T x ) 2 = k(x, x ). 15 / 30

16 Feature maps a kerel fuctios Dot prouct i feature space: Φ(x), Φ(x ) = Φ(x) T Φ(x ) x 2 T 1 x = x x1 x x x 1 x 2 = x 2 1x x 2 2x x 1 x 2 x 1x 2 = (x T x ) 2 = k(x, x ). ot prouct i feature space is a o-liear fuctio k i iput space, calle kerel fuctio ca be calculate without explicitly mappig to the feature space via Φ Φ iuces a kerel fuctio Questio: Ca we avoi efiig Φ a irectly specify k? 16 / 30

17 Kerel fuctios (1) Aswer: Yes! E.g. k(x, x ) = (x T x ) 2 ca be geeralize: k(x, x ) = (x T x ) p calle homogeeous polyomial kerel oe ca show: k(x, x ) = Φ(x) T Φ(x ) with iput space feature space Φ R R D x 1 x p 1 x p 1 1 x 2 x i.e. x is mappe to all the moomials of egree p ote that D kerel fuctio calculates ier prouct i R D without calculatig all moomials which might be computatioally prohibitive 17 / 30

18 Kerel fuctios (2) All kerel fuctios itrouce for GP regressio ca be use! They all factorize, because they are positive efiite: k(x, x ) = Φ(x) T Φ(x ) Aalogously to positive efiite matrices: A = (V Λ 1/2 ) (Λ 1/2 V T ) For each kerel fuctio k there exists a mappig Φ i some possibly ifiite-imesioal feature space. 18 / 30

19 The famous kerel trick Goal: create a oliear versio of a existig liear algorithm Requiremet: the liear algorithm must oly calculate ot proucts of the ata poits Kerelizatio: replace all ot proucts by a kerel fuctio Examples: kerel PCA, kerel LDA, kerel CCA, kerel FDA, kerel ICA, / 30

20 Kerel PCA first itrouce: Schölkopf, Smola, Müller, Noliear compoet aalysis as a kerel eigevalue problem, more etails i: Schölkopf, Smola, Learig with kerels, / 30

21 PCA with ier proucts? PCA: give ata matrix X = [x 1,..., x ] R fi irectio v R of largest variace λ calculate covariace matrix XX T (up to a costat, assume mea zero) λ a v are largest eigevalue a correspoig eigevector of XX T Problem: cov matrix requires outer proucts x i x T j Challege: but ot ier proucts! How ca we formulate PCA solely usig ier proucts? 21 / 30

22 Sigular value ecompositio (SVD) (1) X = USV T U is, a V is, so S is rectagular of size. U a V are uitary, i.e. UU T = I a V V T = I I a I beig a imesioal ietity matrices S is iagoal matrix, etries are calle sigular values (SVs) 22 / 30

23 Sigular value ecompositio graphically Case (i): X = U S V T Case (ii): < X = U S V T 23 / 30

24 Sigular value ecompositio rage a ull space Case (i): V1 T U 1 U 2 V T 2 X = U S V T Case (ii): < V U 1 U 1 2 T V T 2 X = U S V T Null space of X is V 2. Rage of X is U / 30

25 Sigular value ecompositio ecoomy size a rak Case (i): k k k k X = U 1 S V T 1 Case (ii): < k k k k X = U 1 S V T 1 rak of X is k, the umber of o-zero sigular values. 25 / 30

26 From SVD to eigevalue ecompositio (1) Eigevalue ecompositio of square matrix A: A = V ΛV T with eigevalues alog the iagoal of Λ with eigevectors as colums of V compact otatio for Av = λv for all simultaeously (Ecoomy-size) SVD of ata matrix X: Calculate XX T a X T X: X = USV T. XX T = USV T V SU T = US 2 U T = UΛU T X T X = V SU T USV T = V S 2 V T = V ΛV T left sigular vectors U of X are the eigevectors of XX T right sigular vectors V of X are the eigevectors of X T X the square SVs of X are the eigevalues of XX T a of X T X 26 / 30

27 From SVD to eigevalue ecompositio (1) Lemma 8.2 We have the followig formulas to calculate the left sigular vectors from the right oes a vice versa: U = XV Λ 1/2 V = X T UΛ 1/2 Note: if the sigs of Λ 1/2 o ot match S, some of the vectors i U chage their orietatio. However, that is o problem. A similar result hols for the submatrices: U 1 = XV 1 Λ 1/2 1 V 1 = X T U 1 Λ 1/2 1 where U 1 a V 1 cotai the colums of U a V that correspo to large eigevalues i Λ. Proof: Plug SVD of X ito the formulas. 27 / 30

28 PCA base o the Gram matrix Algorithm: calculate eigeecompositio of Gram matrix X T X = V ΛV T ote Λ are also the eigevalues of covariace matrix XX T from above we get formula for U such that XX T = UΛU T U = XV Λ 1/2 similar for U 1 correspoig to the large eigevalues project the ata poits oto the space spae by U 1 Note: we ever calculate XX T Y = U T 1 X = Λ 1/2 1 V T 1 X T X we oly calculate X T X, V 1, Λ 1 a Y 28 / 30

29 PCA base o the Gram matrix with o-zero mea Mea of ata: µ = 1 X1 where 1 is the imesioal oe-vector. Remove the mea: X µ1 T = X 1 X1 1 T = X(I T ) = XH where we efie the Helmert matrix H = I T. ote that H = H T a H = HH (iempotet) Gram matrix for o-zero mea: G = (XH) T XH = HX T XH the rest remais the same! 29 / 30

30 Kerel PCA Steps: (1) replace Gram matrix X T X with a kerel matrix K (2) ceter K by H a fi its eigevalue ecompositio HKH = V ΛV T (3) project cetere ata oto the eigevectors (as above): Y = Λ 1/2 1 V T 1 HKH That s it (for toay)! 30 / 30

Linear Classifiers III

Linear Classifiers III Uiversität Potsdam Istitut für Iformatik Lehrstuhl Maschielles Lere Liear Classifiers III Blaie Nelso, Tobias Scheffer Cotets Classificatio Problem Bayesia Classifier Decisio Liear Classifiers, MAP Models