Uiversität Potsdam Istitut für Iformatik Lehrstuhl Maschielles Lere Liear Classifiers III Blaie Nelso, Tobias Scheffer
Cotets Classificatio Problem Bayesia Classifier Decisio Liear Classifiers, MAP Models Logistic Regressio Regularized Empirical Risk Miimizatio Kerel Perceptro, Support Vector Machie Ridge Regressio, LASSO Represeter Theorem Dualized Perceptro, Dual SVM Mercer Map Learig with Structured Iput & Output Taxoomy, Sequeces, Rakig, Maschielles Lere Decoder, Cuttig Plae Algorithm 2
Review: Liear Models Liear Classifiers: Biary classifier f θ x = φ x T θ + b Multiclass classifier f θ x, y = φ x T θ y + b y Maschielles Lere May learig methods miimize the sum of loss fuctios over the traiig data plus a regularizer. argmi θ l f θ x i, y i + c Ω θ Choice of loss & regularizer gives differet methods Logistic regressio, Perceptro, SVM 3
Review: Feature Mappigs All cosidered liear methods ca be made oliear by meas of feature mappig φ. Better separatios ca be obtaied i feature space Maschielles Lere φ x 1, x 2 = x 1 x 2, x 1 2, x 2 2 Hyperplae i feature space correspods to a oliear surface i origial space 4
Dual Form Liear Model: Motivatio The feature mappig φ x ca be high dimesioal. The size of estimated parameter vector θ depeds o the dimesioality of φ could be ifiite! Maschielles Lere Computatio of φ x is expesive. φ must be computed for each traiig poit x i & for each predictio x. This icurs high computatioal & memory requiremets. How ca we adapt liear methods to efficietly icorporate high dimesioal φ? 5
Dual Form Liear Model Represeter Theorem: If g is strictly mootoically icreasig, the the θ that miimizes has the form θ = L θ = l f θ x i, y i + g f θ 2 α i φ x i, with α i R. Maschielles Lere f θ x = α i φ x i T φ x Ier product is a measure for similarity betwee samples Geerally θ is ay vector i Φ, but we show it must be i the spa of the data. 6
Represeter Theorem: Proof Orthogoal Decompositio: L θ = l f θ x i, y i θ = θ + θ, with θ Θ = α i φ x i α i R ad θ Θ = θ Θ θ T θ = 0 θ Θ + g f θ 2 Maschielles Lere 7
Represeter Theorem: Proof Orthogoal Decompositio: θ = θ + θ, with θ Θ = α i φ x i α i R ad θ Θ = θ Θ θ T θ = 0 θ Θ For ay traiig poit x i it follows that f θ x i = θ T φ x i + θ T φ x i = θ T φ x i Why is θ T φ x i = 0? L θ = l f θ x i, y i + g f θ 2 Maschielles Lere 8
Represeter Theorem: Proof Orthogoal Decompositio: L θ = l f θ x i, y i θ = θ + θ, with θ Θ = α i φ x i α i R ad θ Θ = θ Θ θ T θ = 0 θ Θ For ay traiig poit x i it follows that f θ x i = θ T φ x i + θ T φ x i = θ T φ x i Why is θ T φ x i = 0? Thus, l f θ x i, y i is idepedet of θ. Fially from g θ 2 g θ 2, it follows θ = 0. + g f θ 2 Maschielles Lere g θ 2 = g θ + θ 2 = g θ 2 2 + θ 2 2 g θ 2 Sice θ T θ = 0 (Pythagoras Theorem) Sice g is strictly mootoically icreasig. 9
Represeter Theorem Give traiig data T = x 1, y 1,, x, y ad feature mappig φ x, we costruct a liear fuctio f θ x = θ T φ x ; ie., we fid a hyperplae θ. The hyperplae θ, which miimizes L θ = l f θ x i, y i + g f θ 2, ca be represeted as f θ x = θ T φ x = f α x = α i φ x i T φ x Primal view: f θ x = θ T φ x Hypothesis θ has as may parameters as the dimesioality of φ x. Dual view: f α x = α i φ x i T φ x Hypothesis has as may parameters α i as samples. Maschielles Lere 10
Represeter Theorem Primal view: f θ x = θ T φ x Hypothesis θ has as may parameters as the dimesioality of φ x. Good if there are may samples with few attributes. Maschielles Lere Dual view: f α x = α i φ x T i φ x Hypothesis has as may parameters α i as samples. Good if there are few samples with high dimesioality. The represetatio φ x ca eve be ifiite dimesioal, as log as the ier product ca be efficietly computed: ie., by a kerel fuctio. 11
Dual Form of a Liear Model A parameter vector θ, which miimizes a regularized loss fuctio, is always a liear combiatio of traiig samples: θ = α i φ x i The dual form α has as may parameters α i as there are traiig samples.. Dual decisio fuctio: f α x = α i φ x i T φ x The primal form θ has as may parameters θ i as the dimesioality of the feature mappig φ x. Primal decisio fuctio: f θ x = θ T φ x The dual form is advatageous if there are few samples ad may attributes. Maschielles Lere 12
Maschielles Lere DUAL PERCEPTRON 13
Dualized Perceptro Perceptro classificatio: f θ x i = θ T φ x i Perceptro: the algo. halts whe the followig holds for all samples y i f θ x i > 0 = sample lies o the correct side of the hyperplae. Update step: θ = θ + y i φ x i + + + + + - - - - Maschielles Lere 14
Dualized Perceptro Perceptro classificatio: f θ x i = θ T φ x i T f α x i = j=1 α j φ x j φ xi Perceptro: the algo. halts whe the followig holds for all samples + + + + + - - - - Maschielles Lere T y i f θ x i > 0 y i j=1 α j φ x j φ xi > 0 = sample lies o the correct side of the hyperplae. Update step: θ = θ + y i φ x i 15
Dualized Perceptro Perceptro classificatio: f θ x i = θ T φ x i T f α x i = j=1 α j φ x j φ xi Perceptro: the algo. halts whe the followig holds for all samples T y i f θ x i > 0 y i j=1 α j φ x j φ xi > 0 = sample lies o the correct side of the hyperplae. Update step: θ = θ + y i φ x i α ew j=1 j φ x j = j=1 α old j φ x j α ew i φ x i = α old i φ x i + y i φ x i α ew i = α old i + y i + + + + + - - + y i φ x i - - Maschielles Lere 16
Dualized Perceptro Algorithm Perceptro(Istaces x i, y i ) Set α = 0 DO FOR i = 1,, IF y i f α x i 0 Maschielles Lere THEN α i = α i + y i END WHILE α chages RETURN α Decisio fuctio: f α x = α i φ x i T φ x 17
Dualized Perceptro Perceptro loss, o regularizer Dual form of the decisio fuctio: f α x = α i φ x T i φ x Dual form of the update rule: Maschielles Lere If y i f α x i 0, the α i = α i + y i Equivalet to the primal form of the perceptro Advatageous to use istead of the primal perceptro if there are few samples ad φ x is high dimesioal. 18
Maschielles Lere DUAL SUPPORT VECTOR MACHINE 19
Dualized Support Vector Machie + 1 2λ θt θ Primal: mi max 0,1 y i φ x T i θ θ Equivalet optimizatio problem with side costraits: mi θ,ξ λ ξ i + 1 2 θt θ such that y i φ x i T θ 1 ξ i ad ξ i 0 Maschielles Lere Goal: dual formulizatio of the optimizatio problem 20
Dualized Support Vector Machie Optimizatio problem with side costraits: mi θ,ξ λ ξ i + 1 2 θt θ such that y i φ x i T θ 1 ξ i ad ξ i 0 Lagrage fuctio with Lagrage-Multipliers β 0 ad β 0 0 for the side costraits: L θ, ξ, β, β 0 = λ ξ i + θt θ 2 β i y i φ x T i θ 1 + ξ i Goal fuctio: Z θ, ξ Side costraits: g θ, ξ 0 Lagrage fuctio: Z θ, ξ βg θ, ξ Optimizatio problem without side costraits: mi max L θ, ξ, β, β0 θ,ξ β,β0 β i 0 ξ i Maschielles Lere 21
Dualized Support Vector Machie Lagrage fuctio: L θ, ξ, β, β 0 = λ ξ i + θt θ 2 β i y i φ x T i θ 1 + ξ i Sice it is covex i θ, ξ, the strog duality theorem gives: mi θ,ξ max L θ, ξ, β, β0 β,β0 max mi L θ, ξ, β, β,β 0 β0 θ,ξ Miimum: set the derivative of L w.r.t. θ, ξ to zero β i 0 ξ i Maschielles Lere L θ, ξ, β, θ β0 = 0 θ = β i y i ξ i L θ, ξ, β, β 0 = 0 λ = β i + β i 0 α i φ x i Relatio betwee primal ad dual parameters The Represeter Theorem. 22
Dualized Support Vector Machie Substitute the derived parameters ito the Lagrage fuctio: L θ, ξ, β, β 0 = 1 2 θ T θ β i y i φ x i T θ 1 + ξ i β i 0 ξ i + λ ξ i θ = λ = β i + β i 0 β i y i φ x i Maschielles Lere 23
Dualized Support Vector Machie Substitute the derived parameters ito the Lagrage fuctio: L θ, ξ, β, β 0 = 1 2 β i y i φ x i β i y i φ x i T β j y j φ x j j=1 T j=1 1 + ξ i β j y j φ x j θ = β i 0 ξ i λ = β i + β i 0 β i y i φ x i + λ ξ i Maschielles Lere 24
Dualized Support Vector Machie Substitute the derived parameters ito the Lagrage fuctio: L θ, ξ, β, β 0 = 1 2 β i y i φ x i β i y i φ x i T β j y j φ x j j=1 T j=1 1 + ξ i β j y j φ x j θ = β i 0 ξ i λ = β i + β i 0 β i y i φ x i + λ ξ i Maschielles Lere = 1 2 β i β j y i y j φ x i T φ x j i,j=1 β i β j y i y j φ x i T φ x j + β i β i + β i 0 ξ i + λ ξ i i,j=1 =λ 25
Dualized Support Vector Machie Substitute the derived parameters ito the Lagrage fuctio: L θ, ξ, β, β 0 = 1 2 β i y i φ x i β i y i φ x i T β j y j φ x j j=1 T j=1 1 + ξ i β j y j φ x j θ = β i 0 ξ i λ = β i + β i 0 β i y i φ x i + λ ξ i Maschielles Lere = 1 β 2 i β j y i y j φ x T i φ x j i,j=1 β i β j y i y j φ x T 0 i φ x j + β i β i + β i i,j=1 1 = β i 2 =λ i,j=1 β i β j y i y j φ x i T φ x j ξ i + λ ξ i 26
Dualized Support Vector Machie Substitutig the derived parameters ito the Lagrage fuctio: 1 L β = β i 2 i,j=1 β i β j y i y j φ x i θ = T φ x j Sice β 0 0 & 1λ = β + β 0 it follows: 0 β i λ. λ = β i + β i 0 β i y i φ x i Maschielles Lere Optimizatio criterio of the dual SVM: max β 1 β i 2 i,j=1 β i β j y i y j φ x i T φ x j L1-Regularizer of β (sparse) such that Large if β i, β j > 0 0 β i λ for similar samples of differet classes. 27
Dualized Support Vector Machie λ = β i + β i 0 Optimizatio criterio of the dual SVM: max β 1 β i 2 i,j=1 such that 0 β i λ β i β j y i y j φ x i T φ x j β i : y i φ x i T θ 1 ξ i β i 0 : ξ i 0 A Lagrage multiplier is greater tha 0 exactly whe its correspodig costrait is fulfilled with equality. β i = 0 β i 0 = λ: we have y i φ x i T θ > 1 ξ i & ξ i = 0. (Distace to the hyperplae exceeds the margi) β i = λ β i 0 = 0: we have y i φ x i T θ = 1 ξ i & ξ i > 0. (Sample violates the margi) 0 < β i < λ: we have y i φ x i T θ = 1 ξ i & ξ i = 0. (Sample lies o the margi) 28 Maschielles Lere
Dualized Support Vector Machie A Lagrage multiplier is greater tha 0 exactly whe its correspodig costrait is fulfilled with equality. β i = 0 β i 0 = λ: we have y i φ x i T θ > 1 ξ i & ξ i = 0. (Distace to the hyperplae exceeds the margi) Maschielles Lere 29
Dualized Support Vector Machie A Lagrage multiplier is greater tha 0 exactly whe its correspodig costrait is fulfilled with equality. β i = λ β i 0 = 0: we have y i φ x i T θ = 1 ξ i & ξ i > 0. (Sample violates the margi) Maschielles Lere 30
Dualized Support Vector Machie A Lagrage multiplier is greater tha 0 exactly whe its correspodig costrait is fulfilled with equality. 0 < β i < λ: we have y i φ x i T θ = 1 ξ i & ξ i = 0. (Sample lies o the margi) Maschielles Lere 31
Dualized Support Vector Machie Optimizatio criterio of the dual SVM: max β 1 β i 2 i,j=1 such that 0 β i λ β i β j y i y j φ x i T φ x j Maschielles Lere Optimizatio over parameters β. Solutio foud with QP-Solver i O 2. Sparse solutio. Samples oly appear as pairwise ier products. 32
Dualized Support Vector Machie Primal ad dual optimizatio problem have the same solutio. θ = x i SV β i y i φ x i Dual form of the decisio fuctio: Support Vectors: β i > 0 Maschielles Lere f β x = β i y i φ x i T φ x Primal SVM: x i SV Solutio is a Vector θ i the space of the attributes. Dual SVM: The same solutio is represeted as weights β i of the samples. 33
Dualized Support Vector Machie Hige loss, L2-regularizatio Dual form of the decisio fuctio: f β x = β i y i φ x i T φ x x i SV Dual form of the optimizatio problem: max β β i 1 2 i,j=1 such that β i β j y i y j φ x i 0 β i λ T φ x j Primal ad dual optimizatio problems have idetical solutios but differet forms The dual is advatageous if there are few samples ad φ x is high dimesioal. Maschielles Lere 34
Kerel Support Vector Machie Optimizatio criterio of the kerel SVM: max β such that 1 β i 2 Decisio fuctio: i,j=1 0 β i λ β i β j y i y j k x i, x j Ier product fuctio Maschielles Lere f β x = β i y i k x i, x x i SV Samples oly iteract through the kerel fuctio k x i, x j. The feature Mappig φ o loger appears i the optimizatio problem or decisio fuctio. 35
Maschielles Lere KERNELS 36
Kerels ad Kerel Methods The feature mappig φ x ca be high dimesioal. Number of estimated parameters θ depeds o φ. Computatio of φ x is expesive. Previously: give φ x, φ x T φ x measures the similarity betwee samples. May methods ca be formulated so that samples oly appear as pairwise ier products. Idea: Replace ier product with ay similarity measure k x, x = φ x T φ x ad map samples oly implicitly. For which fuctios k does there exist a mappig φ x, so that k represets a ier product? 37 Maschielles Lere
Kerel Fuctios: Motivatio Ca we simply chose k x, x to be ay fuctio? We eed k to be a ier product i some feature space else, we lose meaig & covexity Optimizatio criterio of the kerel SVM: max 0 β λ 1 β i 2 i,j=1 β i β j y i y j k x i, x j max 0 β λ 1T β 1 2 y β T K y β K ij = k x i, x j Maschielles Lere This optimizatio is covex (with a uique solutio) if K is positive semi-defiite (o-egative eigevalues). 38
Recap: Positive Defiiteess A matrix K is called positive semi-defiite (PSD) if x x T Kx 0 holds for all x. It is called positive defiite if equality holds oly at x = 0. Maschielles Lere A fuctio k is called positive semi-defiite (PSD) if z x k x, x z x dxdx 0 holds for all cotiuous fuctios z. 39
Recap: Positive Defiiteess A matrix K is called positive semi-defiite if x x T Kx 0 Example: a covariace matrix Σ 1 N x; μ, Σ = 2π m Σ e 1 2 x μ T Σ 1 x μ Positive defiite matrices are ivertible ad its iverse is also positive defiite. 3 Maschielles Lere 2 1 x 2 0-1 -2 40-3 -3-2 -1 0 1 2 3 x
Recap: Positive Defiiteess A matrix K is called positive semi-defiite if x x T Kx 0 Example: a covariace matrix Σ 1 N x; μ, Σ = 2π m Σ e Positive defiiteess implies a orm: x = x T Σ 1 x 1 2 x μ T Σ 1 x μ 3 Maschielles Lere 2 1 x 2 0-1 -2 41-3 -3-2 -1 0 1 2 3 x
Recap: Positive Defiiteess A matrix K is called positive semi-defiite if x x T Kx 0 Example: a covariace matrix Σ 1 N x; μ, Σ = 2π m Σ e Positive defiiteess implies a orm: x = x T Σ 1 x Mahalaobis distace: d x, x = x x T Σ 1 x x 1 2 x μ T Σ 1 x μ x 2 3 2 1 0 Maschielles Lere -1-2 42-3 -3-2 -1 0 1 2 3 x
Kerels Theorem: For every positive defiite fuctio k there exists a mappig φ x such that k x, x = φ x T φ x for all x ad x. Maschielles Lere This mappig is ot uique. For example, cosider φ 1 x = x ad φ 2 x = x. φ 1 x T φ 1 x = x T x = x T x = φ 2 x T φ 2 x Gram matrix or kerel matrix K; with K ij = k x i, x j Matrix of ier products = pairwise similarity betwee samples; a matrix. k x, x ist PSD iff K is a PSD matrix for every dataset 43
Kerels Theorem: For every positive defiite fuctio k there exists a mappig φ x such that k x, x = φ x T φ x for all x ad x. Maschielles Lere Costructive Proofs: Reproducig Kerel Hilbert Space (RKHS). Idea: Defie mappig as fuctio φ x = k x,. Defie ier product, betwee fuctios. Show k x, x = k x,, k x,. Mercer mappig. Idea: Decompositio of k i terms of its eigefuctios. Practically relevat: fiite case. 44
Maschielles Lere MERCER MAP 45
Mercer Map Eigevalue decompositio: Every symmetric matrix K ca be decomposed i terms of its eigevectors u i ad eigevalues λ i : K = UΛU 1, with Λ = λ 1 0 & U = 0 λ u 1 u Maschielles Lere If K is positive semi-defiite, the λ i R 0+ The eigevectors are orthoormal (u i T u i = 1 ad u i T u j = 0) ad U is orthogoal: U T = U 1. 46
Mercer Map Thus it holds: Eigevalue decompositio K = UΛU T = UΛ 1/2 Λ 1/2 U T = UΛ 1/2 UΛ 1/2 T Diagoal matrix with λ i Maschielles Lere Feature mappig for used traiig data ca the be defied as φ x 1 φ x = UΛ 1/2 T 47
Mercer Map Feature mappig for used traiig data ca the be defied as φ x 1 φ x = UΛ 1/2 T Kerel matrix betwee traiig ad test data K test = Φ X trai T Φ X test = UΛ 1/2 Φ X test Equatio results i a mappig of the test data: Maschielles Lere Φ X test = UΛ 1/2 1 K test Φ X test = Λ 1/2 U T K test U T = U 1 48
Mercer Map Useful if a learig problem is give as a kerel fuctio but learig should take place i the primal. Maschielles Lere For example if the kerel matrix will be too large (quadratic memory cosumptio!) Better motivated! 49
Kerel Fuctios Polyomial kerels: k poly x i, x j = x i T x j + 1 p Radial basis fuctios: k RBF x i, x j = e γ x i x j 2 Sigmoid kerels, Strig kerels (eg., for classificatio of gee sequeces). Graph kerels for learig with structured istaces. Maschielles Lere Further Literature: B.Schölkopf, A.J.Smola: Learig with Kerels. 2002 50
Polyomial Kerels Kerel fuctio: k poly x i, x j = x T i x j + 1 p Which trasformatio φ correspods to this kerel? Example: 2-D iput space, p = 2. Maschielles Lere 51
Polyomial Kerels Kerel: k poly x i, x j = x i T x j + 1 p, 2D-iput, p = 2. k poly x i, x j = x i T x j + 1 2 = x i1 x i2 x j1 x j2 + 1 2 = x i1 x j1 + x i2 x j2 + 1 2 Maschielles Lere 52
Polyomial Kerels Kerel: k poly x i, x j = x i T x j + 1 p, 2D-iput, p = 2. k poly x i, x j = x i T x j + 1 2 = x i1 x i2 x j1 x j2 + 1 = x 2 i1 x 2 j1 + x 2 i2 x j2 2 = x i1 x j1 + x i2 x j2 + 1 2 2 + 2x i1 x j1 x i2 x j2 + 2x i1 x j1 + 2x i2 x j2 + 1 2 x j1 2 x j2 Maschielles Lere = 2 x i1 2 x i2 2x i1 x i2 2x i1 2x i2 1 φ x i T All moomials of degree 2 over iput attributes 2x j1 x j2 2x j1 2x j2 1 φ x j 53
Polyomial Kerels Kerel: k poly x i, x j = x i T x j + 1 p, 2D-iput, p = 2. k poly x i, x j = x i T x j + 1 2 = x i1 x i2 x j1 x j2 + 1 = x 2 i1 x 2 j1 + x 2 i2 x j2 2 = x i1 x j1 + x i2 x j2 + 1 2 2 + 2x i1 x j1 x i2 x j2 + 2x i1 x j1 + 2x i2 x j2 + 1 2 x j1 2 x j2 Maschielles Lere = 2 x i1 2 x i2 2x i1 x i2 2x i1 2x i2 1 φ x i T All moomials of degree 2 over iput attributes = x i x i 2x i 1 T x j x j 2x j 1 2x j1 x j2 2x j1 2x j2 1 φ x j 54
RBF Kerel Kerel: k RBF x i, x j = exp γ x i x j 2 Which trasformatio φ correspods to this kerel? Maschielles Lere 55
Kerels Kerel fuctio k x, x = φ x T φ x computes the ier product of the feature mappig of 2 istaces. The kerel fuctio ca ofte be computed without a explicit represetatio φ x. Eg, polyomial kerel: k poly x i, x j = x i T x j + 1 p Maschielles Lere Ifiite-dimesioal feature mappigs are possible Eg., RBF kerel: k RBF x i, x j = e γ x i x j 2 For every positive defiite kerel there is a feature mappig φ x such that k x, x = φ x T φ x. For a give kerel matrix, the Mercer map provides a feature mappig. 56
Summary Represeter Theorem: f θ x = α i φ x i T φ x Samples oly iteract through ier products Kerel Perceptro Kerel SVM Perceptro(Istaces x i, y i ) Set α = 0 DO FOR i = 1,, END IF y i f α x i 0 THEN α i = α i + y i WHILE α chages RETURN α Kerel Fuctios: positive defiite fuctios k x, x are a ier product for some feature space. max β Liear model: f θ x = α i k x i, x such that β i 1 2 i,j=1 0 β i λ β i β j y i y j k x i, x j f β x = β i y i k x i, x x i SV Maschielles Lere Feature mappigs are doe implicitly 57
Frohe Weihachte & Eie Gute Rutsch Maschielles Lere Next Time: Kerels for structured data & learig for structured outputs 58