Linear Classifiers III

Similar documents
Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Support vector machine revisited

Support Vector Machines and Kernel Methods

10-701/ Machine Learning Mid-term Exam Solution

18.657: Mathematics of Machine Learning

CSCI567 Machine Learning (Fall 2014)

Chapter 7. Support Vector Machine

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Linear Support Vector Machines

Apply change-of-basis formula to rewrite x as a linear combination of eigenvectors v j.

Questions and answers, kernel part

6.867 Machine learning, lecture 7 (Jaakkola) 1

Boosting. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 1, / 32

6.867 Machine learning

Introduction to Optimization Techniques. How to Solve Equations

Machine Learning for Data Science (CS 4786)

10/2/ , 5.9, Jacob Hays Amit Pillay James DeFelice

Machine Learning Regression I Hamid R. Rabiee [Slides are based on Bishop Book] Spring

Lecture 4. Hw 1 and 2 will be reoped after class for every body. New deadline 4/20 Hw 3 and 4 online (Nima is lead)

Introduction to Machine Learning DIS10

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

Intelligent Systems I 08 SVM

4. Linear Classification. Kai Yu

Admin REGULARIZATION. Schedule. Midterm 9/29/16. Assignment 5. Midterm next week, due Friday (more on this in 1 min)

Definitions and Theorems. where x are the decision variables. c, b, and a are constant coefficients.

CSCI567 Machine Learning (Fall 2014)

Learning Bounds for Support Vector Machines with Learned Kernels

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion

Symmetric Matrices and Quadratic Forms

Massachusetts Institute of Technology

For a 3 3 diagonal matrix we find. Thus e 1 is a eigenvector corresponding to eigenvalue λ = a 11. Thus matrix A has eigenvalues 2 and 3.

Optimally Sparse SVMs

Naïve Bayes. Naïve Bayes

Optimization Methods MIT 2.098/6.255/ Final exam

Abstract Vector Spaces. Abstract Vector Spaces

Intro to Learning Theory

Machine Learning Theory (CS 6783)

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Machine Learning Brett Bernstein

State Space Representation

KERNEL MODELS AND SUPPORT VECTOR MACHINES

CALCULATION OF FIBONACCI VECTORS

Machine Learning for Data Science (CS 4786)

Eigenvalues and Eigenvectors

1 Duality revisited. AM 221: Advanced Optimization Spring 2016

Inverse Matrix. A meaning that matrix B is an inverse of matrix A.

Filter banks. Separately, the lowpass and highpass filters are not invertible. removes the highest frequency 1/ 2and

Algebra of Least Squares

Markov Decision Processes

Supplemental Material: Proofs

Math Solutions to homework 6

Continuous Models for Eigenvalue Problems

Math 61CM - Solutions to homework 3

Solutions to home assignments (sketches)

b i u x i U a i j u x i u x j

A brief introduction to linear algebra

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

Efficient GMM LECTURE 12 GMM II

6. Kalman filter implementation for linear algebraic equations. Karhunen-Loeve decomposition

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

FMA901F: Machine Learning Lecture 4: Linear Models for Classification. Cristian Sminchisescu

Overview. Structured learning for feature selection and prediction. Motivation for feature selection. Outline. Part III:

CMSE 820: Math. Foundations of Data Sci.

Information-based Feature Selection

Lecture 2 October 11

TMA4205 Numerical Linear Algebra. The Poisson problem in R 2 : diagonalization methods

Matrix Algebra from a Statistician s Perspective BIOS 524/ Scalar multiple: ka

Lecture 8: October 20, Applications of SVD: least squares approximation

Algorithms for Clustering

Chimica Inorganica 3

Mathematical Foundations -1- Sets and Sequences. Sets and Sequences

Introduction to Optimization Techniques

The Method of Least Squares. To understand least squares fitting of data.

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

Machine Learning. Ilya Narsky, Caltech

Math 203A, Solution Set 8.

CHAPTER 3. GOE and GUE

Homework Set #3 - Solutions

Hilbert Space and Least-squares Collocation

THE SOLUTION OF NONLINEAR EQUATIONS f( x ) = 0.

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

2D DSP Basics: 2D Systems

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector

Review Problems 1. ICME and MS&E Refresher Course September 19, 2011 B = C = AB = A = A 2 = A 3... C 2 = C 3 = =

Orthogonal transformations

Empirical Process Theory and Oracle Inequalities

Lecture #20. n ( x p i )1/p = max

PC5215 Numerical Recipes with Applications - Review Problems

Week 1, Lecture 2. Neural Network Basics. Announcements: HW 1 Due on 10/8 Data sets for HW 1 are online Project selection 10/11. Suggested reading :

IIT JAM Mathematical Statistics (MS) 2006 SECTION A

Real Numbers R ) - LUB(B) may or may not belong to B. (Ex; B= { y: y = 1 x, - Note that A B LUB( A) LUB( B)

FFTs in Graphics and Vision. The Fast Fourier Transform

Polynomials with Rational Roots that Differ by a Non-zero Constant. Generalities

Problem Set 2 Solutions

Lecture 20. Brief Review of Gram-Schmidt and Gauss s Algorithm

Homework 2. Show that if h is a bounded sesquilinear form on the Hilbert spaces X and Y, then h has the representation

A collocation method for singular integral equations with cosecant kernel via Semi-trigonometric interpolation

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 7

15.081J/6.251J Introduction to Mathematical Programming. Lecture 21: Primal Barrier Interior Point Algorithm

Chapter 4. Fourier Series

Transcription:

Uiversität Potsdam Istitut für Iformatik Lehrstuhl Maschielles Lere Liear Classifiers III Blaie Nelso, Tobias Scheffer

Cotets Classificatio Problem Bayesia Classifier Decisio Liear Classifiers, MAP Models Logistic Regressio Regularized Empirical Risk Miimizatio Kerel Perceptro, Support Vector Machie Ridge Regressio, LASSO Represeter Theorem Dualized Perceptro, Dual SVM Mercer Map Learig with Structured Iput & Output Taxoomy, Sequeces, Rakig, Maschielles Lere Decoder, Cuttig Plae Algorithm 2

Review: Liear Models Liear Classifiers: Biary classifier f θ x = φ x T θ + b Multiclass classifier f θ x, y = φ x T θ y + b y Maschielles Lere May learig methods miimize the sum of loss fuctios over the traiig data plus a regularizer. argmi θ l f θ x i, y i + c Ω θ Choice of loss & regularizer gives differet methods Logistic regressio, Perceptro, SVM 3

Review: Feature Mappigs All cosidered liear methods ca be made oliear by meas of feature mappig φ. Better separatios ca be obtaied i feature space Maschielles Lere φ x 1, x 2 = x 1 x 2, x 1 2, x 2 2 Hyperplae i feature space correspods to a oliear surface i origial space 4

Dual Form Liear Model: Motivatio The feature mappig φ x ca be high dimesioal. The size of estimated parameter vector θ depeds o the dimesioality of φ could be ifiite! Maschielles Lere Computatio of φ x is expesive. φ must be computed for each traiig poit x i & for each predictio x. This icurs high computatioal & memory requiremets. How ca we adapt liear methods to efficietly icorporate high dimesioal φ? 5

Dual Form Liear Model Represeter Theorem: If g is strictly mootoically icreasig, the the θ that miimizes has the form θ = L θ = l f θ x i, y i + g f θ 2 α i φ x i, with α i R. Maschielles Lere f θ x = α i φ x i T φ x Ier product is a measure for similarity betwee samples Geerally θ is ay vector i Φ, but we show it must be i the spa of the data. 6

Represeter Theorem: Proof Orthogoal Decompositio: L θ = l f θ x i, y i θ = θ + θ, with θ Θ = α i φ x i α i R ad θ Θ = θ Θ θ T θ = 0 θ Θ + g f θ 2 Maschielles Lere 7

Represeter Theorem: Proof Orthogoal Decompositio: θ = θ + θ, with θ Θ = α i φ x i α i R ad θ Θ = θ Θ θ T θ = 0 θ Θ For ay traiig poit x i it follows that f θ x i = θ T φ x i + θ T φ x i = θ T φ x i Why is θ T φ x i = 0? L θ = l f θ x i, y i + g f θ 2 Maschielles Lere 8

Represeter Theorem: Proof Orthogoal Decompositio: L θ = l f θ x i, y i θ = θ + θ, with θ Θ = α i φ x i α i R ad θ Θ = θ Θ θ T θ = 0 θ Θ For ay traiig poit x i it follows that f θ x i = θ T φ x i + θ T φ x i = θ T φ x i Why is θ T φ x i = 0? Thus, l f θ x i, y i is idepedet of θ. Fially from g θ 2 g θ 2, it follows θ = 0. + g f θ 2 Maschielles Lere g θ 2 = g θ + θ 2 = g θ 2 2 + θ 2 2 g θ 2 Sice θ T θ = 0 (Pythagoras Theorem) Sice g is strictly mootoically icreasig. 9

Represeter Theorem Give traiig data T = x 1, y 1,, x, y ad feature mappig φ x, we costruct a liear fuctio f θ x = θ T φ x ; ie., we fid a hyperplae θ. The hyperplae θ, which miimizes L θ = l f θ x i, y i + g f θ 2, ca be represeted as f θ x = θ T φ x = f α x = α i φ x i T φ x Primal view: f θ x = θ T φ x Hypothesis θ has as may parameters as the dimesioality of φ x. Dual view: f α x = α i φ x i T φ x Hypothesis has as may parameters α i as samples. Maschielles Lere 10

Represeter Theorem Primal view: f θ x = θ T φ x Hypothesis θ has as may parameters as the dimesioality of φ x. Good if there are may samples with few attributes. Maschielles Lere Dual view: f α x = α i φ x T i φ x Hypothesis has as may parameters α i as samples. Good if there are few samples with high dimesioality. The represetatio φ x ca eve be ifiite dimesioal, as log as the ier product ca be efficietly computed: ie., by a kerel fuctio. 11

Dual Form of a Liear Model A parameter vector θ, which miimizes a regularized loss fuctio, is always a liear combiatio of traiig samples: θ = α i φ x i The dual form α has as may parameters α i as there are traiig samples.. Dual decisio fuctio: f α x = α i φ x i T φ x The primal form θ has as may parameters θ i as the dimesioality of the feature mappig φ x. Primal decisio fuctio: f θ x = θ T φ x The dual form is advatageous if there are few samples ad may attributes. Maschielles Lere 12

Maschielles Lere DUAL PERCEPTRON 13

Dualized Perceptro Perceptro classificatio: f θ x i = θ T φ x i Perceptro: the algo. halts whe the followig holds for all samples y i f θ x i > 0 = sample lies o the correct side of the hyperplae. Update step: θ = θ + y i φ x i + + + + + - - - - Maschielles Lere 14

Dualized Perceptro Perceptro classificatio: f θ x i = θ T φ x i T f α x i = j=1 α j φ x j φ xi Perceptro: the algo. halts whe the followig holds for all samples + + + + + - - - - Maschielles Lere T y i f θ x i > 0 y i j=1 α j φ x j φ xi > 0 = sample lies o the correct side of the hyperplae. Update step: θ = θ + y i φ x i 15

Dualized Perceptro Perceptro classificatio: f θ x i = θ T φ x i T f α x i = j=1 α j φ x j φ xi Perceptro: the algo. halts whe the followig holds for all samples T y i f θ x i > 0 y i j=1 α j φ x j φ xi > 0 = sample lies o the correct side of the hyperplae. Update step: θ = θ + y i φ x i α ew j=1 j φ x j = j=1 α old j φ x j α ew i φ x i = α old i φ x i + y i φ x i α ew i = α old i + y i + + + + + - - + y i φ x i - - Maschielles Lere 16

Dualized Perceptro Algorithm Perceptro(Istaces x i, y i ) Set α = 0 DO FOR i = 1,, IF y i f α x i 0 Maschielles Lere THEN α i = α i + y i END WHILE α chages RETURN α Decisio fuctio: f α x = α i φ x i T φ x 17

Dualized Perceptro Perceptro loss, o regularizer Dual form of the decisio fuctio: f α x = α i φ x T i φ x Dual form of the update rule: Maschielles Lere If y i f α x i 0, the α i = α i + y i Equivalet to the primal form of the perceptro Advatageous to use istead of the primal perceptro if there are few samples ad φ x is high dimesioal. 18

Maschielles Lere DUAL SUPPORT VECTOR MACHINE 19

Dualized Support Vector Machie + 1 2λ θt θ Primal: mi max 0,1 y i φ x T i θ θ Equivalet optimizatio problem with side costraits: mi θ,ξ λ ξ i + 1 2 θt θ such that y i φ x i T θ 1 ξ i ad ξ i 0 Maschielles Lere Goal: dual formulizatio of the optimizatio problem 20

Dualized Support Vector Machie Optimizatio problem with side costraits: mi θ,ξ λ ξ i + 1 2 θt θ such that y i φ x i T θ 1 ξ i ad ξ i 0 Lagrage fuctio with Lagrage-Multipliers β 0 ad β 0 0 for the side costraits: L θ, ξ, β, β 0 = λ ξ i + θt θ 2 β i y i φ x T i θ 1 + ξ i Goal fuctio: Z θ, ξ Side costraits: g θ, ξ 0 Lagrage fuctio: Z θ, ξ βg θ, ξ Optimizatio problem without side costraits: mi max L θ, ξ, β, β0 θ,ξ β,β0 β i 0 ξ i Maschielles Lere 21

Dualized Support Vector Machie Lagrage fuctio: L θ, ξ, β, β 0 = λ ξ i + θt θ 2 β i y i φ x T i θ 1 + ξ i Sice it is covex i θ, ξ, the strog duality theorem gives: mi θ,ξ max L θ, ξ, β, β0 β,β0 max mi L θ, ξ, β, β,β 0 β0 θ,ξ Miimum: set the derivative of L w.r.t. θ, ξ to zero β i 0 ξ i Maschielles Lere L θ, ξ, β, θ β0 = 0 θ = β i y i ξ i L θ, ξ, β, β 0 = 0 λ = β i + β i 0 α i φ x i Relatio betwee primal ad dual parameters The Represeter Theorem. 22

Dualized Support Vector Machie Substitute the derived parameters ito the Lagrage fuctio: L θ, ξ, β, β 0 = 1 2 θ T θ β i y i φ x i T θ 1 + ξ i β i 0 ξ i + λ ξ i θ = λ = β i + β i 0 β i y i φ x i Maschielles Lere 23

Dualized Support Vector Machie Substitute the derived parameters ito the Lagrage fuctio: L θ, ξ, β, β 0 = 1 2 β i y i φ x i β i y i φ x i T β j y j φ x j j=1 T j=1 1 + ξ i β j y j φ x j θ = β i 0 ξ i λ = β i + β i 0 β i y i φ x i + λ ξ i Maschielles Lere 24

Dualized Support Vector Machie Substitute the derived parameters ito the Lagrage fuctio: L θ, ξ, β, β 0 = 1 2 β i y i φ x i β i y i φ x i T β j y j φ x j j=1 T j=1 1 + ξ i β j y j φ x j θ = β i 0 ξ i λ = β i + β i 0 β i y i φ x i + λ ξ i Maschielles Lere = 1 2 β i β j y i y j φ x i T φ x j i,j=1 β i β j y i y j φ x i T φ x j + β i β i + β i 0 ξ i + λ ξ i i,j=1 =λ 25

Dualized Support Vector Machie Substitute the derived parameters ito the Lagrage fuctio: L θ, ξ, β, β 0 = 1 2 β i y i φ x i β i y i φ x i T β j y j φ x j j=1 T j=1 1 + ξ i β j y j φ x j θ = β i 0 ξ i λ = β i + β i 0 β i y i φ x i + λ ξ i Maschielles Lere = 1 β 2 i β j y i y j φ x T i φ x j i,j=1 β i β j y i y j φ x T 0 i φ x j + β i β i + β i i,j=1 1 = β i 2 =λ i,j=1 β i β j y i y j φ x i T φ x j ξ i + λ ξ i 26

Dualized Support Vector Machie Substitutig the derived parameters ito the Lagrage fuctio: 1 L β = β i 2 i,j=1 β i β j y i y j φ x i θ = T φ x j Sice β 0 0 & 1λ = β + β 0 it follows: 0 β i λ. λ = β i + β i 0 β i y i φ x i Maschielles Lere Optimizatio criterio of the dual SVM: max β 1 β i 2 i,j=1 β i β j y i y j φ x i T φ x j L1-Regularizer of β (sparse) such that Large if β i, β j > 0 0 β i λ for similar samples of differet classes. 27

Dualized Support Vector Machie λ = β i + β i 0 Optimizatio criterio of the dual SVM: max β 1 β i 2 i,j=1 such that 0 β i λ β i β j y i y j φ x i T φ x j β i : y i φ x i T θ 1 ξ i β i 0 : ξ i 0 A Lagrage multiplier is greater tha 0 exactly whe its correspodig costrait is fulfilled with equality. β i = 0 β i 0 = λ: we have y i φ x i T θ > 1 ξ i & ξ i = 0. (Distace to the hyperplae exceeds the margi) β i = λ β i 0 = 0: we have y i φ x i T θ = 1 ξ i & ξ i > 0. (Sample violates the margi) 0 < β i < λ: we have y i φ x i T θ = 1 ξ i & ξ i = 0. (Sample lies o the margi) 28 Maschielles Lere

Dualized Support Vector Machie A Lagrage multiplier is greater tha 0 exactly whe its correspodig costrait is fulfilled with equality. β i = 0 β i 0 = λ: we have y i φ x i T θ > 1 ξ i & ξ i = 0. (Distace to the hyperplae exceeds the margi) Maschielles Lere 29

Dualized Support Vector Machie A Lagrage multiplier is greater tha 0 exactly whe its correspodig costrait is fulfilled with equality. β i = λ β i 0 = 0: we have y i φ x i T θ = 1 ξ i & ξ i > 0. (Sample violates the margi) Maschielles Lere 30

Dualized Support Vector Machie A Lagrage multiplier is greater tha 0 exactly whe its correspodig costrait is fulfilled with equality. 0 < β i < λ: we have y i φ x i T θ = 1 ξ i & ξ i = 0. (Sample lies o the margi) Maschielles Lere 31

Dualized Support Vector Machie Optimizatio criterio of the dual SVM: max β 1 β i 2 i,j=1 such that 0 β i λ β i β j y i y j φ x i T φ x j Maschielles Lere Optimizatio over parameters β. Solutio foud with QP-Solver i O 2. Sparse solutio. Samples oly appear as pairwise ier products. 32

Dualized Support Vector Machie Primal ad dual optimizatio problem have the same solutio. θ = x i SV β i y i φ x i Dual form of the decisio fuctio: Support Vectors: β i > 0 Maschielles Lere f β x = β i y i φ x i T φ x Primal SVM: x i SV Solutio is a Vector θ i the space of the attributes. Dual SVM: The same solutio is represeted as weights β i of the samples. 33

Dualized Support Vector Machie Hige loss, L2-regularizatio Dual form of the decisio fuctio: f β x = β i y i φ x i T φ x x i SV Dual form of the optimizatio problem: max β β i 1 2 i,j=1 such that β i β j y i y j φ x i 0 β i λ T φ x j Primal ad dual optimizatio problems have idetical solutios but differet forms The dual is advatageous if there are few samples ad φ x is high dimesioal. Maschielles Lere 34

Kerel Support Vector Machie Optimizatio criterio of the kerel SVM: max β such that 1 β i 2 Decisio fuctio: i,j=1 0 β i λ β i β j y i y j k x i, x j Ier product fuctio Maschielles Lere f β x = β i y i k x i, x x i SV Samples oly iteract through the kerel fuctio k x i, x j. The feature Mappig φ o loger appears i the optimizatio problem or decisio fuctio. 35

Maschielles Lere KERNELS 36

Kerels ad Kerel Methods The feature mappig φ x ca be high dimesioal. Number of estimated parameters θ depeds o φ. Computatio of φ x is expesive. Previously: give φ x, φ x T φ x measures the similarity betwee samples. May methods ca be formulated so that samples oly appear as pairwise ier products. Idea: Replace ier product with ay similarity measure k x, x = φ x T φ x ad map samples oly implicitly. For which fuctios k does there exist a mappig φ x, so that k represets a ier product? 37 Maschielles Lere

Kerel Fuctios: Motivatio Ca we simply chose k x, x to be ay fuctio? We eed k to be a ier product i some feature space else, we lose meaig & covexity Optimizatio criterio of the kerel SVM: max 0 β λ 1 β i 2 i,j=1 β i β j y i y j k x i, x j max 0 β λ 1T β 1 2 y β T K y β K ij = k x i, x j Maschielles Lere This optimizatio is covex (with a uique solutio) if K is positive semi-defiite (o-egative eigevalues). 38

Recap: Positive Defiiteess A matrix K is called positive semi-defiite (PSD) if x x T Kx 0 holds for all x. It is called positive defiite if equality holds oly at x = 0. Maschielles Lere A fuctio k is called positive semi-defiite (PSD) if z x k x, x z x dxdx 0 holds for all cotiuous fuctios z. 39

Recap: Positive Defiiteess A matrix K is called positive semi-defiite if x x T Kx 0 Example: a covariace matrix Σ 1 N x; μ, Σ = 2π m Σ e 1 2 x μ T Σ 1 x μ Positive defiite matrices are ivertible ad its iverse is also positive defiite. 3 Maschielles Lere 2 1 x 2 0-1 -2 40-3 -3-2 -1 0 1 2 3 x

Recap: Positive Defiiteess A matrix K is called positive semi-defiite if x x T Kx 0 Example: a covariace matrix Σ 1 N x; μ, Σ = 2π m Σ e Positive defiiteess implies a orm: x = x T Σ 1 x 1 2 x μ T Σ 1 x μ 3 Maschielles Lere 2 1 x 2 0-1 -2 41-3 -3-2 -1 0 1 2 3 x

Recap: Positive Defiiteess A matrix K is called positive semi-defiite if x x T Kx 0 Example: a covariace matrix Σ 1 N x; μ, Σ = 2π m Σ e Positive defiiteess implies a orm: x = x T Σ 1 x Mahalaobis distace: d x, x = x x T Σ 1 x x 1 2 x μ T Σ 1 x μ x 2 3 2 1 0 Maschielles Lere -1-2 42-3 -3-2 -1 0 1 2 3 x

Kerels Theorem: For every positive defiite fuctio k there exists a mappig φ x such that k x, x = φ x T φ x for all x ad x. Maschielles Lere This mappig is ot uique. For example, cosider φ 1 x = x ad φ 2 x = x. φ 1 x T φ 1 x = x T x = x T x = φ 2 x T φ 2 x Gram matrix or kerel matrix K; with K ij = k x i, x j Matrix of ier products = pairwise similarity betwee samples; a matrix. k x, x ist PSD iff K is a PSD matrix for every dataset 43

Kerels Theorem: For every positive defiite fuctio k there exists a mappig φ x such that k x, x = φ x T φ x for all x ad x. Maschielles Lere Costructive Proofs: Reproducig Kerel Hilbert Space (RKHS). Idea: Defie mappig as fuctio φ x = k x,. Defie ier product, betwee fuctios. Show k x, x = k x,, k x,. Mercer mappig. Idea: Decompositio of k i terms of its eigefuctios. Practically relevat: fiite case. 44

Maschielles Lere MERCER MAP 45

Mercer Map Eigevalue decompositio: Every symmetric matrix K ca be decomposed i terms of its eigevectors u i ad eigevalues λ i : K = UΛU 1, with Λ = λ 1 0 & U = 0 λ u 1 u Maschielles Lere If K is positive semi-defiite, the λ i R 0+ The eigevectors are orthoormal (u i T u i = 1 ad u i T u j = 0) ad U is orthogoal: U T = U 1. 46

Mercer Map Thus it holds: Eigevalue decompositio K = UΛU T = UΛ 1/2 Λ 1/2 U T = UΛ 1/2 UΛ 1/2 T Diagoal matrix with λ i Maschielles Lere Feature mappig for used traiig data ca the be defied as φ x 1 φ x = UΛ 1/2 T 47

Mercer Map Feature mappig for used traiig data ca the be defied as φ x 1 φ x = UΛ 1/2 T Kerel matrix betwee traiig ad test data K test = Φ X trai T Φ X test = UΛ 1/2 Φ X test Equatio results i a mappig of the test data: Maschielles Lere Φ X test = UΛ 1/2 1 K test Φ X test = Λ 1/2 U T K test U T = U 1 48

Mercer Map Useful if a learig problem is give as a kerel fuctio but learig should take place i the primal. Maschielles Lere For example if the kerel matrix will be too large (quadratic memory cosumptio!) Better motivated! 49

Kerel Fuctios Polyomial kerels: k poly x i, x j = x i T x j + 1 p Radial basis fuctios: k RBF x i, x j = e γ x i x j 2 Sigmoid kerels, Strig kerels (eg., for classificatio of gee sequeces). Graph kerels for learig with structured istaces. Maschielles Lere Further Literature: B.Schölkopf, A.J.Smola: Learig with Kerels. 2002 50

Polyomial Kerels Kerel fuctio: k poly x i, x j = x T i x j + 1 p Which trasformatio φ correspods to this kerel? Example: 2-D iput space, p = 2. Maschielles Lere 51

Polyomial Kerels Kerel: k poly x i, x j = x i T x j + 1 p, 2D-iput, p = 2. k poly x i, x j = x i T x j + 1 2 = x i1 x i2 x j1 x j2 + 1 2 = x i1 x j1 + x i2 x j2 + 1 2 Maschielles Lere 52

Polyomial Kerels Kerel: k poly x i, x j = x i T x j + 1 p, 2D-iput, p = 2. k poly x i, x j = x i T x j + 1 2 = x i1 x i2 x j1 x j2 + 1 = x 2 i1 x 2 j1 + x 2 i2 x j2 2 = x i1 x j1 + x i2 x j2 + 1 2 2 + 2x i1 x j1 x i2 x j2 + 2x i1 x j1 + 2x i2 x j2 + 1 2 x j1 2 x j2 Maschielles Lere = 2 x i1 2 x i2 2x i1 x i2 2x i1 2x i2 1 φ x i T All moomials of degree 2 over iput attributes 2x j1 x j2 2x j1 2x j2 1 φ x j 53

Polyomial Kerels Kerel: k poly x i, x j = x i T x j + 1 p, 2D-iput, p = 2. k poly x i, x j = x i T x j + 1 2 = x i1 x i2 x j1 x j2 + 1 = x 2 i1 x 2 j1 + x 2 i2 x j2 2 = x i1 x j1 + x i2 x j2 + 1 2 2 + 2x i1 x j1 x i2 x j2 + 2x i1 x j1 + 2x i2 x j2 + 1 2 x j1 2 x j2 Maschielles Lere = 2 x i1 2 x i2 2x i1 x i2 2x i1 2x i2 1 φ x i T All moomials of degree 2 over iput attributes = x i x i 2x i 1 T x j x j 2x j 1 2x j1 x j2 2x j1 2x j2 1 φ x j 54

RBF Kerel Kerel: k RBF x i, x j = exp γ x i x j 2 Which trasformatio φ correspods to this kerel? Maschielles Lere 55

Kerels Kerel fuctio k x, x = φ x T φ x computes the ier product of the feature mappig of 2 istaces. The kerel fuctio ca ofte be computed without a explicit represetatio φ x. Eg, polyomial kerel: k poly x i, x j = x i T x j + 1 p Maschielles Lere Ifiite-dimesioal feature mappigs are possible Eg., RBF kerel: k RBF x i, x j = e γ x i x j 2 For every positive defiite kerel there is a feature mappig φ x such that k x, x = φ x T φ x. For a give kerel matrix, the Mercer map provides a feature mappig. 56

Summary Represeter Theorem: f θ x = α i φ x i T φ x Samples oly iteract through ier products Kerel Perceptro Kerel SVM Perceptro(Istaces x i, y i ) Set α = 0 DO FOR i = 1,, END IF y i f α x i 0 THEN α i = α i + y i WHILE α chages RETURN α Kerel Fuctios: positive defiite fuctios k x, x are a ier product for some feature space. max β Liear model: f θ x = α i k x i, x such that β i 1 2 i,j=1 0 β i λ β i β j y i y j k x i, x j f β x = β i y i k x i, x x i SV Maschielles Lere Feature mappigs are doe implicitly 57

Frohe Weihachte & Eie Gute Rutsch Maschielles Lere Next Time: Kerels for structured data & learig for structured outputs 58