Lecture 3: Dual problems and Kernels

Similar documents
Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Lecture 10 Support Vector Machines II

Which Separator? Spring 1

Support Vector Machines

Chapter 6 Support vector machine. Séparateurs à vaste marge

Support Vector Machines

Support Vector Machines

10-701/ Machine Learning, Fall 2005 Homework 3

Linear Classification, SVMs and Nearest Neighbors

Support Vector Machines CS434

Nonlinear Classifiers II

Lagrange Multipliers Kernel Trick

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

Support Vector Machines

Image classification. Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing i them?

Support Vector Machines

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

Kernel Methods and SVMs Extension

Support Vector Machines CS434

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Advanced Introduction to Machine Learning

CSE 252C: Computer Vision III

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Lecture 10 Support Vector Machines. Oct

Lecture 6: Support Vector Machines

Intro to Visual Recognition

Kristin P. Bennett. Rensselaer Polytechnic Institute

FMA901F: Machine Learning Lecture 5: Support Vector Machines. Cristian Sminchisescu

Natural Language Processing and Information Retrieval

Maximal Margin Classifier

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

SVMs: Duality and Kernel Trick. SVMs as quadratic programs

UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 10: Classifica8on with Support Vector Machine (cont.

Kernel Methods and SVMs

Recap: the SVM problem

Feature Selection: Part 1

n α j x j = 0 j=1 has a nontrivial solution. Here A is the n k matrix whose jth column is the vector for all t j=0

8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS

Support Vector Machines. Jie Tang Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University 2012

Solutions to exam in SF1811 Optimization, Jan 14, 2015

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

APPENDIX A Some Linear Algebra

Lecture 20: November 7

Report on Image warping

SVMs: Duality and Kernel Trick. SVMs as quadratic programs

Yong Joon Ryang. 1. Introduction Consider the multicommodity transportation problem with convex quadratic cost function. 1 2 (x x0 ) T Q(x x 0 )

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

18-660: Numerical Methods for Engineering Design and Optimization

Pattern Classification

CSC 411 / CSC D11 / CSC C11

Fisher Linear Discriminant Analysis

UVA CS / Introduc8on to Machine Learning and Data Mining

Communication Complexity 16:198: February Lecture 4. x ij y ij

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal

MEM 255 Introduction to Control Systems Review: Basics of Linear Algebra

17 Support Vector Machines

COS 521: Advanced Algorithms Game Theory and Linear Programming

Ensemble Methods: Boosting

Machine Learning. What is a good Decision Boundary? Support Vector Machines

Quantum Mechanics I - Session 4

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Canonical transformations

Structural Extensions of Support Vector Machines. Mark Schmidt March 30, 2009

Lecture 20: Lift and Project, SDP Duality. Today we will study the Lift and Project method. Then we will prove the SDP duality theorem.

p 1 c 2 + p 2 c 2 + p 3 c p m c 2

Feb 14: Spatial analysis of data fields

A NEW ALGORITHM FOR FINDING THE MINIMUM DISTANCE BETWEEN TWO CONVEX HULLS. Dougsoo Kaown, B.Sc., M.Sc. Dissertation Prepared for the Degree of

The exam is closed book, closed notes except your one-page cheat sheet.

Module 9. Lecture 6. Duality in Assignment Problems

Neural networks. Nuno Vasconcelos ECE Department, UCSD

Classification as a Regression Problem

CSE 546 Midterm Exam, Fall 2014(with Solution)

Machine Learning. Support Vector Machines. Eric Xing. Lecture 4, August 12, Reading: Eric CMU,

Outline. Multivariate Parametric Methods. Multivariate Data. Basic Multivariate Statistics. Steven J Zeil

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Mechanics Physics 151

1 Convex Optimization

= = = (a) Use the MATLAB command rref to solve the system. (b) Let A be the coefficient matrix and B be the right-hand side of the system.

Georgia Tech PHYS 6124 Mathematical Methods of Physics I

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso

Generalized Linear Methods

15-381: Artificial Intelligence. Regression and cross validation

Bézier curves. Michael S. Floater. September 10, These notes provide an introduction to Bézier curves. i=0

Linear, affine, and convex sets and hulls In the sequel, unless otherwise specified, X will denote a real vector space.

An efficient algorithm for multivariate Maclaurin Newton transformation

Week 5: Neural Networks

Bezier curves. Michael S. Floater. August 25, These notes provide an introduction to Bezier curves. i=0

Support Vector Novelty Detection

Linear discriminants. Nuno Vasconcelos ECE Department, UCSD

Computing Correlated Equilibria in Multi-Player Games

Affine and Riemannian Connections

Buckingham s pi-theorem

Calculation of time complexity (3%)

( ) [ ( k) ( k) ( x) ( ) ( ) ( ) [ ] ξ [ ] [ ] [ ] ( )( ) i ( ) ( )( ) 2! ( ) = ( ) 3 Interpolation. Polynomial Approximation.

ISSN: ISO 9001:2008 Certified International Journal of Engineering and Innovative Technology (IJEIT) Volume 3, Issue 1, July 2013

Primer on High-Order Moment Estimators

Difference Equations

Linear Feature Engineering 11

By : Moataz Al-Haj. Vision Topics Seminar (University of Haifa) Supervised by Dr. Hagit Hel-Or

Transcription:

Lecture 3: Dual problems and Kernels C4B Machne Learnng Hlary 211 A. Zsserman Prmal and dual forms Lnear separablty revsted Feature mappng Kernels for SVMs Kernel trck requrements radal bass functons SVM revew We have seen that for an SVM learnng a lnear classfer w > x + b s formulated as solvng an optmzaton problem over w : mn w R w 2 + C max (, 1 y f(x )) d Ths quadratc optmzaton problem s known as the prmal problem. Instead,theSVMcanbeformulatedtolearnalnearclassfer α y (x > x)+b by solvng an optmzaton problem over α. Ths s know as the dual problem, and we wll look at the advantages of ths formulaton.

Sketch dervaton of dual form The Representer Theorem states that the soluton w can always be wrtten as a lnear combnaton of the tranng data: Proof: see example sheet. w = α j y j x j j=1 Now, substtute for w n w > x + b α j y j x j > ³ x + b = α j y j xj > x + b j=1 j=1 and for w n the cost functon mn w w 2 ³ subject to y w > x + b 1, w 2 X X = α j y j x > j α k y k x k = X α j α k y j y k (x > j x k ) j k jk Hence, an equvalent optmzaton problem s over α j X mn α α j α k y j y k (x > j x k ) subject to y α j y j (x > j x )+b 1, j jk j=1 and a few more steps are requred to complete the dervaton. Prmal and dual formulatons N s number of tranng ponts, and d s dmenson of feature vector x. Prmal problem: for w R d mn w R w 2 + C d max (, 1 y f(x )) Dual problem: for α R N (stated wthout proof): max α X α 1 X α j α k y j y k (x > j x k ) subject to α C for, and X 2 jk α y = Complexty of soluton s O(d 3 ) for prmal, and O(N 3 )fordual If N<<dthen more effcent to solve for α than w Dual form only nvolves (x j > x ). We wll return to why ths s an advantage whenwelookatkernels.

Prmal and dual formulatons Prmal verson of classfer: w > x + b Dual verson of classfer: α y (x > x)+b At frst sght the dual form appears to have the dsadvantage of a K-NN classfer t requres the tranng data ponts x. However, many of the α s are zero. The ones that are non-zero defne the support vectors x. Support Vector Machne w T x + b = b w Support Vector Support Vector w X α y (x > x)+b support vectors

Handlng data that s not lnearly separable ntroduce slack varables mn w R d,ξ R w 2 + C + subject to ξ y ³ w > x + b 1 ξ for =1...N lnear classfer not approprate?? Soluton 1: use polar coordnates r θ < > θ r Data s lnearly separable n polar coordnates Acts non-lnearly n orgnal space!! Φ : Ã x1 x 2 Ã r θ R 2 R 2

Soluton 2: map data to hgher dmenson Φ : Ã x1 x 2! x 2 1 x 2 2 2x1 x 2 R 2 R 3 Z = 2x 1 x 2 Y = x 2 2 X = x 2 1 Data s lnearly separable n 3D Ths means that the problem can stll be solved by a lnear classfer SVM classfers n a transformed feature space R d R D Φ Φ : x Φ(x) R d R D Learn classfer lnear n w for R D : w > Φ(x) + b

Prmal Classfer n transformed feature space Classfer, wthw R D : Learnng, forw R D w > Φ(x) + b mn w R w 2 + C D max (, 1 y f(x )) Smply map x to Φ(x) where data s separable Solve for w n hgh dmensonal space R D Complexty of soluton s now O(D 3 ) rather than O(d 3 ) Classfer: Dual Classfer n transformed feature space Learnng: subject to X max α max α X α 1 2 α 1 2 α y x > x + b α y Φ(x ) > Φ(x) + b X jk X jk α j α k y j y k x j > x k α j α k y j y k Φ(x j ) > Φ(x k ) α C for, and X α y =

Dual Classfer n transformed feature space Note, that Φ(x) only occurs n pars Φ(x j ) > Φ(x ) Once the scalar products are computed, complexty s agan O(N 3 ); t s not necessary to learn n the D dmensonal space, as t s for the prmal Wrte k(x j, x )=Φ(x j ) > Φ(x ). Ths s known as a Kernel Classfer: Learnng: subject to max α X α 1 X 2 jk α y k(x, x) + b α j α k y j y k k(x j, x k ) α C for, and X α y = Specal transformatons Φ : Ã x1 x 2! x 2 1 x 2 2 2x1 x 2 R 2 R 3 Φ(x) > Φ(z) = ³ x 2 1,x2 2, 2x 1 x 2 Kernel Trck z 2 1 z 2 2 2z1 z 2 = x 2 1 z2 1 + x2 2 z2 2 +2x 1x 2 z 1 z 2 = (x 1 z 1 + x 2 z 2 ) 2 = (x > z) 2 Classfer can be learnt and appled wthout explctly computng Φ(x) All that s requred s the kernel k(x, z) =(x > z) 2 Complexty s stll O(N 3 )

Example kernels Lnear kernels k(x, x )=x > x Polynomal kernels k(x, x )= ³ 1+x > x d for any d> Contans all polynomals terms up to degree d Gaussan kernels k(x, x ) = exp ³ x x 2 /2σ 2 for σ > Infnte dmensonal feature space Vald kernels when can the kernel trck be used? Gven some arbtrary functon k(x, x j ), how do we know f t corresponds to a scalar product Φ(x ) > Φ(x j )nsome space? Mercer kernels: f k(, )satsfes: Symmetrc k(x, x j )=k(x j, x ) Postve defnte, α > Kα forallα R N,whereK s the N N Gram matrx wth entres K j = k(x, x j ). then k(, ) s a vald kernel. e.g. k(x, z) =x > z s a vald kernel, k(x, z) =x x > z s not.

SVM classfer wth Gaussan kernel α y k(x, x)+b N = sze of tranng data weght (may be zero) support vector Gaussan kernel k(x, x )=exp ³ x x 2 /2σ 2 Radal Bass Functon (RBF) SVM α y exp ³ x x 2 /2σ 2 + b RBF Kernel SVM Example.6.4 feature y.2 -.2 -.4 -.6 -.8 -.6 -.4 -.2.2.4.6.8 1 feature x data s not lnearly separable n orgnal feature space

σ =1. C = 1 1 α y exp ³ x x 2 /2σ 2 + b σ =1. C = 1 Decrease C, gves wder (soft) margn

σ =1. C =1 α y exp ³ x x 2 /2σ 2 + b σ =1. C = α y exp ³ x x 2 /2σ 2 + b

σ =.25 C = Decrease sgma, moves towards nearest neghbour classfer σ =.1 C = α y exp ³ x x 2 /2σ 2 + b

-6-4 -2 Kernel block structure N N Gram matrx wth entres K j = k(x, x j ) lnear kernel (C =.1) RBF kernel (C = 1, gamma =.25) -6-4 -2 pos. vec. neg. vec. supp. vec. margn vec. decson bound. pos. margn neg. margn 2 2 4 4 6 6-6 -4-2 2 4 6 Gram matrx lnear kernel 5 1 15 2 25 3 1 2 3 2 15 1 5-5 -1-15 -6-4 -2 2 4 6 5 1 15 2 25 3 Gram matrx RBF kernel 1 2 3 1.8.6.4.2 The kernel measures smlarty between the ponts Kernel Trck - Summary Classfers can be learnt for hgh dmensonal features spaces, wthout actually havng to map the ponts nto the hgh dmensonal space Data may be lnearly separable n the hgh dmensonal space, but not lnearly separable n the orgnal feature space Kernels can be used for an SVM because of the scalar product n the dual form, but can also be used elsewhere they are not ted to the SVM formalsm Kernels apply also to objects that are not vectors, e.g. k(h, h )= P k mn(h k,h k ) for hstograms wth bns h k,h k We wll see other examples of kernels later n regresson and unsupervsed learnng

Background readng Bshop, chapters 6.2 and 7 Haste et al, chapter 12 More on web page: http://www.robots.ox.ac.uk/~az/lectures/ml