Sparse Kernel Machines - SVM

Similar documents
Support Vector Machine (continued)

Kernel Methods and Support Vector Machines

Introduction Dual Representations Kernel Design RBF Linear Reg. GP Regression GP Classification Summary. Kernel Methods. Henrik I Christensen

Linear Models for Classification

Support Vector Machine

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Pattern Recognition 2018 Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Max Margin-Classifier

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

CS798: Selected topics in Machine Learning

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) and Kernel Methods

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Sequential Minimal Optimization (SMO)

Review: Support vector machines. Machine learning techniques and image analysis

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Support Vector Machine (SVM) and Kernel Methods

Linear & nonlinear classifiers

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Support Vector Machines Explained

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines for Classification: A Statistical Portrait

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Linear & nonlinear classifiers

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

(Kernels +) Support Vector Machines

Neural Networks - II

Support Vector Machines

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Support Vector Machines.

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Statistical Pattern Recognition

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Statistical Machine Learning from Data

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

A Tutorial on Support Vector Machine

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Introduction to SVM and RVM

Perceptron Revisited: Linear Separators. Support Vector Machines

Support Vector Machine & Its Applications

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Linear Classification and SVM. Dr. Xin Zhang

Support Vector Machine for Classification and Regression

Chapter 6: Classification

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Announcements - Homework

Support Vector Machines for Classification and Regression

Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

STATE GENERALIZATION WITH SUPPORT VECTOR MACHINES IN REINFORCEMENT LEARNING. Ryo Goto, Toshihiro Matsui and Hiroshi Matsuo

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

CS145: INTRODUCTION TO DATA MINING

Support Vector Machine

TDT4173 Machine Learning

Convex Optimization and Support Vector Machine

Support Vector Machines, Kernel SVM

Classifier Complexity and Support Vector Classifiers

Support Vector Machines. Maximizing the Margin

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Support Vector Machines

Incorporating detractors into SVM classification

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support Vector Machines

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Support Vector Machines

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM

Machine Learning Lecture 7

L5 Support Vector Classification

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Support Vector Machine. Industrial AI Lab.

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Applied inductive learning - Lecture 7

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Machine Learning. Support Vector Machines. Manfred Huber

Lecture Notes on Support Vector Machine

ML (cont.): SUPPORT VECTOR MACHINES

SVMs, Duality and the Kernel Trick

Relevance Vector Machines

Machine Learning : Support Vector Machines

Machine Learning And Applications: Supervised Learning-SVM

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Non-linear Support Vector Machines

Lecture 7: Kernels for Classification and Regression

Midterm: CS 6375 Spring 2015 Solutions

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

18.9 SUPPORT VECTOR MACHINES

SVM optimization and Kernel methods

Jeff Howbert Introduction to Machine Learning Winter

Low Bias Bagged Support Vector Machines

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

HOMEWORK 4: SVMS AND KERNELS

Transcription:

Sparse Kernel Machines - SVM Henrik I. Christensen Robotics & Intelligent Machines @ GT Georgia Institute of Technology, Atlanta, GA 30332-0280 hic@cc.gatech.edu Henrik I. Christensen (RIM@GT) Support Vector Machines 1 / 42

Outline 1 Introduction 2 Maximum Margin Classifiers 3 Multi-Class SVM s 4 The regression case 5 Small Example 6 Summary Henrik I. Christensen (RIM@GT) Support Vector Machines 2 / 42

Introduction Last time we talked about Kernels and Memory Based Models Estimate the full GRAM matrix can pose a major challenge Desirable to store only the relevant data Two possible solutions discussed 1 Support Vector Machines (Vapnik, et al.) 2 Relevance Vector Machines Main difference in how posterior probabilities are handled Small robotics example at end to show SVM performance Henrik I. Christensen (RIM@GT) Support Vector Machines 3 / 42

Outline 1 Introduction 2 Maximum Margin Classifiers 3 Multi-Class SVM s 4 The regression case 5 Small Example 6 Summary Henrik I. Christensen (RIM@GT) Support Vector Machines 4 / 42

Maximum Margin Classifiers - Preliminaries Lets initially consider a linear two-class problems y(x) = w T φ(x) + b with φ(.) being a feature space transformation and b is the bias factor Given a training dataset x i, i {1...N} Target values t i, i {1...N}, t i { 1, 1} Assume for now that there is a linear solution to the problem Henrik I. Christensen (RIM@GT) Support Vector Machines 5 / 42

The objective The objective here is to optimize the margin Let s just keep the points at the margin y = 1 y = 0 y = 1 y = 1 y = 0 y = 1 margin Henrik I. Christensen (RIM@GT) Support Vector Machines 6 / 42

Recap distances and metrics x 2 y > 0 y = 0 y < 0 R 1 R 2 w x y(x) w x x 1 w 0 w Henrik I. Christensen (RIM@GT) Support Vector Machines 7 / 42

The objective function We know that y(x) and t are supposed to have the same sign so that y(x)t > 0, i.e. The solution is then arg max w,b t n y(x n ) w = t n(w T φ(x n ) + b) w { 1 [ ] } w min t n (w T φ(x n ) + b) n We can scale w and b without loss of generality. Scale parameters to make the key vector points ) t n (w T φ(x n ) + b = 1 Then for all data points it is true ) t n (w T φ(x n ) + b 1 Henrik I. Christensen (RIM@GT) Support Vector Machines 8 / 42

Parameter estimation We need to optimize w 1 which can be seen as minimizing w 2 subject to the margin requirements In Lagrange terms this is then L(w, b, a) = 1 2 w 2 N n=1 Analyzing partial derivatives gives us { ) } a n t n (w T φ(x n ) + b 1 w = 0 = N a n t n φ(x n ) n=1 N a n t n n=1 Henrik I. Christensen (RIM@GT) Support Vector Machines 9 / 42

Parameter estimation Eliminating w and b from the objective function we have L(a) = N a n 1 2 n=1 N N a n a m t n t m k(x n, x m ) n=1 m=1 This is a quadratic optimization problem - see in a minute We can evaluate new points using the form N y(x) = a n t n k(x, x n ) n=1 Henrik I. Christensen (RIM@GT) Support Vector Machines 10 / 42

Estimation of the bias Once w has been estimated we can use that for estimation of the bias ( b = 1 t n ) a m t m k(x n, x m ) N S m S n S Henrik I. Christensen (RIM@GT) Support Vector Machines 11 / 42

Illustrative Synthetic Example Henrik I. Christensen (RIM@GT) Support Vector Machines 12 / 42

Status We have formulated the objective function Still not clear how we will solve it! We have assumed the classes are separable How about more messy data? Henrik I. Christensen (RIM@GT) Support Vector Machines 13 / 42

Overlapping class distributions Assume some data cannot be correctly classified Lets define a margin distance Consider ξ n = t n y(x n ) 1 ξ < 0 - correct classification 2 ξ = 0 - at the margin / decision boundary 3 ξ [0; 1] between decision boundary and margin 4 ξ [1; 2] between margin and other boundary 5 ξ > 2 - the point is definitely misclassified Henrik I. Christensen (RIM@GT) Support Vector Machines 14 / 42

Overlap in margin ξ > 1 y = 1 y = 0 y = 1 ξ < 1 ξ = 0 ξ = 0 Henrik I. Christensen (RIM@GT) Support Vector Machines 15 / 42

Recasting the problem Optimizing not just for w but also for misclassification So we have N C ξ n + 1 2 w n=1 where C is a regularization coefficient. We have a new objective function L(w, b, a) = 1 N N N 2 w 2 + C ξ n a n {t n y(x n ) 1 + ξ n } µ n ξ n where a and µ are Lagrange multipliers n+1 n=1 n=1 Henrik I. Christensen (RIM@GT) Support Vector Machines 16 / 42

Optimization As before we can derivate partial derivatives and find the extrema. The resulting objective function is then L(a) = N a n 1 2 n=1 N n=1 m=1 N a n a m t n t m k(x n, x m ) which is like before bit the constraints are a little different 0 a n C and N n=1 a nt n = 0 which is across all training samples Many training samples will have a n = 0 which is the same as saying they are not at the margin. Henrik I. Christensen (RIM@GT) Support Vector Machines 17 / 42

Generating a solution Solutions are generated through analysis of all training date Re-organization enable some optimization (Vapnik, 1982) Sequential minimal optimization is a common approach (Platt, 2000) Considers pairwise interaction between Lagrange multipliers Complexity is somewhere between linear and quadratic Henrik I. Christensen (RIM@GT) Support Vector Machines 18 / 42

Mixed example 2 0 2 2 0 2 Henrik I. Christensen (RIM@GT) Support Vector Machines 19 / 42

Outline 1 Introduction 2 Maximum Margin Classifiers 3 Multi-Class SVM s 4 The regression case 5 Small Example 6 Summary Henrik I. Christensen (RIM@GT) Support Vector Machines 20 / 42

Multi-Class SVMs This far the discussion has been for the two-class problem How to extend to K classes? 1 One versus the rest 2 Hierarchical Trees - One vs One 3 Coding the classes to generate a new problem Henrik I. Christensen (RIM@GT) Support Vector Machines 21 / 42

One versus the rest Training for each class with all the others serving as the non-class training samples Typically training is skewed - too few positives compared to negatives Better fit for the negatives The one vs all implies extra complexity in training K 2 Henrik I. Christensen (RIM@GT) Support Vector Machines 22 / 42

Tree classifier Organize the problem as a tree selection Best first elimination - select easy cases first Based on pairwise comparison of classes. Still requires extra comparison of K 2 classes Henrik I. Christensen (RIM@GT) Support Vector Machines 23 / 42

Coding new classes Considering optimization of an error coding How to minimize the criteria function to minimize errors Considered a generalization of voting based strategy Poses a larger training challenge Henrik I. Christensen (RIM@GT) Support Vector Machines 24 / 42

Outline 1 Introduction 2 Maximum Margin Classifiers 3 Multi-Class SVM s 4 The regression case 5 Small Example 6 Summary Henrik I. Christensen (RIM@GT) Support Vector Machines 25 / 42

The regression case In regression the target is not separation of classes but minimization of regression error, i.e. N {y n t n } 2 + λ 2 w 2 n=1 The problem is similar to the case for mixed classes We can define an error function similar to the ξ term used for classification, an example could be: { 0, y t ɛ E ɛ (y(x) t) = y t ɛ otherwise Henrik I. Christensen (RIM@GT) Support Vector Machines 26 / 42

Example ɛ error function E(z) ɛ 0 ɛ z Henrik I. Christensen (RIM@GT) Support Vector Machines 27 / 42

The regularized error function The optimization is then wrt to the error function C N E ɛ (y(x n ) t n ) + 1 2 w 2 n=1 Just as before we can define a Lagrangian to be optimized / set criteria for the optimization The criteria are largely the same (see book for details, Eqns 7.56-7.60) Henrik I. Christensen (RIM@GT) Support Vector Machines 28 / 42

Regression illustration t 1 0 1 0 x 1 Henrik I. Christensen (RIM@GT) Support Vector Machines 29 / 42

Outline 1 Introduction 2 Maximum Margin Classifiers 3 Multi-Class SVM s 4 The regression case 5 Small Example 6 Summary Henrik I. Christensen (RIM@GT) Support Vector Machines 30 / 42

Categorization of Rooms Example of using SVM for room categorization Recognition of different types of rooms across extended periods Training data recorded over a period of 6 months Training and evaluation across 3 different settings Extensive evaluation Henrik I. Christensen (RIM@GT) Support Vector Machines 31 / 42

Room Categories Henrik I. Christensen (RIM@GT) Support Vector Machines 32 / 42

Training Organization Henrik I. Christensen (RIM@GT) Support Vector Machines 33 / 42

Training Organization Henrik I. Christensen (RIM@GT) Support Vector Machines 34 / 42

Preprocessing of data Henrik I. Christensen (RIM@GT) Support Vector Machines 35 / 42

SVM details The system uses a χ 2 kernel. The kernel is widely used for histogram comparison The kernel is defined as K(x, y) = e γχ2 (x,y) χ 2 (x, y) = { xi y i 2 / x i + y i } i Initially introduced by Marszalek, et al, IJCV 2007. Trained used one vs the rest Henrik I. Christensen (RIM@GT) Support Vector Machines 36 / 42

SVM results - Video Henrik I. Christensen (RIM@GT) Support Vector Machines 37 / 42

The recognition results Henrik I. Christensen (RIM@GT) Support Vector Machines 38 / 42

Another small example How to remove dependency on background? (Roobaert, 1999) Henrik I. Christensen (RIM@GT) Support Vector Machines 39 / 42

Smart use of SVMs - a hack with applications Henrik I. Christensen (RIM@GT) Support Vector Machines 40 / 42

Outline 1 Introduction 2 Maximum Margin Classifiers 3 Multi-Class SVM s 4 The regression case 5 Small Example 6 Summary Henrik I. Christensen (RIM@GT) Support Vector Machines 41 / 42

Summary An approach to storage of key data for recognition/regression Definition of optimization to recognize data points The learning is fairly involved (complex) Basically a quadratic optimization problem Evaluation across all training data Keep the essential data 1 Training can be costly 2 Execution can be fast - optimized Multi-class cases can pose a bit of a challenge Henrik I. Christensen (RIM@GT) Support Vector Machines 42 / 42