Sparse Kernel Machines - SVM

Size: px

Start display at page:

Download "Sparse Kernel Machines - SVM"

Nicholas Alexander
5 years ago
Views:

1 Sparse Kernel Machines - SVM Henrik I. Christensen Robotics & Intelligent GT Georgia Institute of Technology, Atlanta, GA hic@cc.gatech.edu Henrik I. Christensen (RIM@GT) Support Vector Machines 1 / 42

2 Outline 1 Introduction 2 Maximum Margin Classifiers 3 Multi-Class SVM s 4 The regression case 5 Small Example 6 Summary Henrik I. Christensen (RIM@GT) Support Vector Machines 2 / 42

3 Introduction Last time we talked about Kernels and Memory Based Models Estimate the full GRAM matrix can pose a major challenge Desirable to store only the relevant data Two possible solutions discussed 1 Support Vector Machines (Vapnik, et al.) 2 Relevance Vector Machines Main difference in how posterior probabilities are handled Small robotics example at end to show SVM performance Henrik I. Christensen (RIM@GT) Support Vector Machines 3 / 42

4 Outline 1 Introduction 2 Maximum Margin Classifiers 3 Multi-Class SVM s 4 The regression case 5 Small Example 6 Summary Henrik I. Christensen (RIM@GT) Support Vector Machines 4 / 42

5 Maximum Margin Classifiers - Preliminaries Lets initially consider a linear two-class problems y(x) = w T φ(x) + b with φ(.) being a feature space transformation and b is the bias factor Given a training dataset x i, i {1...N} Target values t i, i {1...N}, t i { 1, 1} Assume for now that there is a linear solution to the problem Henrik I. Christensen (RIM@GT) Support Vector Machines 5 / 42

6 The objective The objective here is to optimize the margin Let s just keep the points at the margin y = 1 y = 0 y = 1 y = 1 y = 0 y = 1 margin Henrik I. Christensen (RIM@GT) Support Vector Machines 6 / 42

7 Recap distances and metrics x 2 y > 0 y = 0 y < 0 R 1 R 2 w x y(x) w x x 1 w 0 w Henrik I. Christensen (RIM@GT) Support Vector Machines 7 / 42

8 The objective function We know that y(x) and t are supposed to have the same sign so that y(x)t > 0, i.e. The solution is then arg max w,b t n y(x n ) w = t n(w T φ(x n ) + b) w { 1 [ ] } w min t n (w T φ(x n ) + b) n We can scale w and b without loss of generality. Scale parameters to make the key vector points ) t n (w T φ(x n ) + b = 1 Then for all data points it is true ) t n (w T φ(x n ) + b 1 Henrik I. Christensen (RIM@GT) Support Vector Machines 8 / 42

9 Parameter estimation We need to optimize w 1 which can be seen as minimizing w 2 subject to the margin requirements In Lagrange terms this is then L(w, b, a) = 1 2 w 2 N n=1 Analyzing partial derivatives gives us { ) } a n t n (w T φ(x n ) + b 1 w = 0 = N a n t n φ(x n ) n=1 N a n t n n=1 Henrik I. Christensen (RIM@GT) Support Vector Machines 9 / 42

10 Parameter estimation Eliminating w and b from the objective function we have L(a) = N a n 1 2 n=1 N N a n a m t n t m k(x n, x m ) n=1 m=1 This is a quadratic optimization problem - see in a minute We can evaluate new points using the form N y(x) = a n t n k(x, x n ) n=1 Henrik I. Christensen (RIM@GT) Support Vector Machines 10 / 42

11 Estimation of the bias Once w has been estimated we can use that for estimation of the bias ( b = 1 t n ) a m t m k(x n, x m ) N S m S n S Henrik I. Christensen (RIM@GT) Support Vector Machines 11 / 42

12 Illustrative Synthetic Example Henrik I. Christensen Support Vector Machines 12 / 42

13 Status We have formulated the objective function Still not clear how we will solve it! We have assumed the classes are separable How about more messy data? Henrik I. Christensen Support Vector Machines 13 / 42

14 Overlapping class distributions Assume some data cannot be correctly classified Lets define a margin distance Consider ξ n = t n y(x n ) 1 ξ < 0 - correct classification 2 ξ = 0 - at the margin / decision boundary 3 ξ [0; 1] between decision boundary and margin 4 ξ [1; 2] between margin and other boundary 5 ξ > 2 - the point is definitely misclassified Henrik I. Christensen (RIM@GT) Support Vector Machines 14 / 42

15 Overlap in margin ξ > 1 y = 1 y = 0 y = 1 ξ < 1 ξ = 0 ξ = 0 Henrik I. Christensen (RIM@GT) Support Vector Machines 15 / 42

16 Recasting the problem Optimizing not just for w but also for misclassification So we have N C ξ n w n=1 where C is a regularization coefficient. We have a new objective function L(w, b, a) = 1 N N N 2 w 2 + C ξ n a n {t n y(x n ) 1 + ξ n } µ n ξ n where a and µ are Lagrange multipliers n+1 n=1 n=1 Henrik I. Christensen (RIM@GT) Support Vector Machines 16 / 42

17 Optimization As before we can derivate partial derivatives and find the extrema. The resulting objective function is then L(a) = N a n 1 2 n=1 N n=1 m=1 N a n a m t n t m k(x n, x m ) which is like before bit the constraints are a little different 0 a n C and N n=1 a nt n = 0 which is across all training samples Many training samples will have a n = 0 which is the same as saying they are not at the margin. Henrik I. Christensen (RIM@GT) Support Vector Machines 17 / 42

18 Generating a solution Solutions are generated through analysis of all training date Re-organization enable some optimization (Vapnik, 1982) Sequential minimal optimization is a common approach (Platt, 2000) Considers pairwise interaction between Lagrange multipliers Complexity is somewhere between linear and quadratic Henrik I. Christensen (RIM@GT) Support Vector Machines 18 / 42

19 Mixed example Henrik I. Christensen (RIM@GT) Support Vector Machines 19 / 42

20 Outline 1 Introduction 2 Maximum Margin Classifiers 3 Multi-Class SVM s 4 The regression case 5 Small Example 6 Summary Henrik I. Christensen (RIM@GT) Support Vector Machines 20 / 42

21 Multi-Class SVMs This far the discussion has been for the two-class problem How to extend to K classes? 1 One versus the rest 2 Hierarchical Trees - One vs One 3 Coding the classes to generate a new problem Henrik I. Christensen (RIM@GT) Support Vector Machines 21 / 42

22 One versus the rest Training for each class with all the others serving as the non-class training samples Typically training is skewed - too few positives compared to negatives Better fit for the negatives The one vs all implies extra complexity in training K 2 Henrik I. Christensen (RIM@GT) Support Vector Machines 22 / 42

23 Tree classifier Organize the problem as a tree selection Best first elimination - select easy cases first Based on pairwise comparison of classes. Still requires extra comparison of K 2 classes Henrik I. Christensen (RIM@GT) Support Vector Machines 23 / 42

24 Coding new classes Considering optimization of an error coding How to minimize the criteria function to minimize errors Considered a generalization of voting based strategy Poses a larger training challenge Henrik I. Christensen (RIM@GT) Support Vector Machines 24 / 42

25 Outline 1 Introduction 2 Maximum Margin Classifiers 3 Multi-Class SVM s 4 The regression case 5 Small Example 6 Summary Henrik I. Christensen (RIM@GT) Support Vector Machines 25 / 42

26 The regression case In regression the target is not separation of classes but minimization of regression error, i.e. N {y n t n } 2 + λ 2 w 2 n=1 The problem is similar to the case for mixed classes We can define an error function similar to the ξ term used for classification, an example could be: { 0, y t ɛ E ɛ (y(x) t) = y t ɛ otherwise Henrik I. Christensen (RIM@GT) Support Vector Machines 26 / 42

27 Example ɛ error function E(z) ɛ 0 ɛ z Henrik I. Christensen (RIM@GT) Support Vector Machines 27 / 42

28 The regularized error function The optimization is then wrt to the error function C N E ɛ (y(x n ) t n ) w 2 n=1 Just as before we can define a Lagrangian to be optimized / set criteria for the optimization The criteria are largely the same (see book for details, Eqns ) Henrik I. Christensen (RIM@GT) Support Vector Machines 28 / 42

29 Regression illustration t x 1 Henrik I. Christensen (RIM@GT) Support Vector Machines 29 / 42

30 Outline 1 Introduction 2 Maximum Margin Classifiers 3 Multi-Class SVM s 4 The regression case 5 Small Example 6 Summary Henrik I. Christensen (RIM@GT) Support Vector Machines 30 / 42

31 Categorization of Rooms Example of using SVM for room categorization Recognition of different types of rooms across extended periods Training data recorded over a period of 6 months Training and evaluation across 3 different settings Extensive evaluation Henrik I. Christensen (RIM@GT) Support Vector Machines 31 / 42

32 Room Categories Henrik I. Christensen Support Vector Machines 32 / 42

33 Training Organization Henrik I. Christensen Support Vector Machines 33 / 42

34 Training Organization Henrik I. Christensen Support Vector Machines 34 / 42

35 Preprocessing of data Henrik I. Christensen Support Vector Machines 35 / 42

36 SVM details The system uses a χ 2 kernel. The kernel is widely used for histogram comparison The kernel is defined as K(x, y) = e γχ2 (x,y) χ 2 (x, y) = { xi y i 2 / x i + y i } i Initially introduced by Marszalek, et al, IJCV Trained used one vs the rest Henrik I. Christensen (RIM@GT) Support Vector Machines 36 / 42

37 SVM results - Video Henrik I. Christensen (RIM@GT) Support Vector Machines 37 / 42

38 The recognition results Henrik I. Christensen Support Vector Machines 38 / 42

39 Another small example How to remove dependency on background? (Roobaert, 1999) Henrik I. Christensen Support Vector Machines 39 / 42

40 Smart use of SVMs - a hack with applications Henrik I. Christensen (RIM@GT) Support Vector Machines 40 / 42

41 Outline 1 Introduction 2 Maximum Margin Classifiers 3 Multi-Class SVM s 4 The regression case 5 Small Example 6 Summary Henrik I. Christensen (RIM@GT) Support Vector Machines 41 / 42

42 Summary An approach to storage of key data for recognition/regression Definition of optimization to recognize data points The learning is fairly involved (complex) Basically a quadratic optimization problem Evaluation across all training data Keep the essential data 1 Training can be costly 2 Execution can be fast - optimized Multi-class cases can pose a bit of a challenge Henrik I. Christensen (RIM@GT) Support Vector Machines 42 / 42

Support Vector Machine (continued)

Support Vector Machine (continued) Support Vector Machine continued) Overlapping class distribution: In practice the class-conditional distributions may overlap, so that the training data points are no longer linearly separable. We need