TDT 4173 Machine Learning and Case Based Reasoning Lecture 6 Support Vector Machines. Ensemble Methods Helge Langseth og Agnar Aamodt NTNU IDI Seksjon for intelligente systemer
Outline 1 Wrap-up from last time 2 Support Vector Machines Background Linear separators The dual problem Non-separable subspaces Nonlinearity and kernels Ensemble-methods Background Bagging Boosting 2 TDT4173 Machine Learning
Support Vector Machines Support Vector Machines (SVMs) Kernel Methods Paper by Benne+ and Campbell TDT 4173 Machine Learning and CBR
Description of the task Background Data: 1 We have a set of data D = {(x 1,y 1 ),...,(x m,y m )}. The instances are described by x i, the class is y i. 2 The data is generated by some unknown probability distribution P(x,y). Task: 1 Be able to guess y at a new location x. 2 For SVMs one typically states this as find an unknown function f(x) that estimates y at x. 3 Note! In this lesson we look at binary classification, and let y { 1,+1} denote the classes. 4 We will look for linear functions, i.e., f(x) = b + w T x b + m i=1 w i x i 17 TDT4173 Machine Learning
Linear separators How to find the best linear separator We are looking for a linear separator for this data 18 TDT4173 Machine Learning
Linear separators How to find the best linear separator There are so many solutions... 18 TDT4173 Machine Learning
Linear separators How to find the best linear separator But only one is considered the best! 18 TDT4173 Machine Learning
Linear separators How to find the best linear separator SVMs are called large margin classifiers 18 TDT4173 Machine Learning
Linear separators How to find the best linear separator... and the data points touching the lines are the support vectors 18 TDT4173 Machine Learning
The geometry of the problem Linear separators {x : b + w T x 1} {x : b + w T x 0} {x : b + w T x +1} w Note! Since one line has b + w T x = 1, the other has b + w T x = 1, the length between them is 2/ w. 19 TDT4173 Machine Learning
An optimisation problem Linear separators Optimisation criteria: The distance between margins is 2/ w, so that is what we want to maximise. Equivalently, we can minimise w /2. For simplicity of the mathematics, we will rather minimise w 2 /2 Constraints: The margin separates all data observations correctly: b + w T x i 1 for y i = 1. b + w T x i +1 for y i = +1. Alternative (equivalent) constraint set: y i (b + w T x i ) 1 20 TDT4173 Machine Learning
An optimisation problem (2) Linear separators Mathematical Programming Setting: Combining the above requirements we obtain minimize wrt. w and b: 1 2 w 2 subject to y i (b + w T x i ) 1 0,i = 1,...,m Properties: Problem is convex Hence it has unique minimum Efficient algorithms for solving it exist 21 TDT4173 Machine Learning
The dual problem The dual problem and the convex hull The convex hull of {x j }: The smallest subset of the instance space that is convex contains all elements {x j } is the convex hull of {x j }. Find it by drawing lines between all x j and choose the outermost boundary. 22 TDT4173 Machine Learning
The dual problem The dual problem and the convex hull (2) Look at the difference between the points closest in the convex hulls. The decision line must be orthogonal to the line between the two closest points. c d c d So, we want to minimise c d. c can be written as a weighted sum of all elements in the green class: c = y i =Class 1 α ix i, and similarly for d. 23 TDT4173 Machine Learning
The dual problem The dual problem and the convex hull (3) Minimising c d is (modulo a constant) equivalent to this formulation: minimize wrt. α: 1 2 subject to Properties: m m m α i α j y i y j x T i x j i=1 j=1 i=1 α i m y i α i = 0 and that α i 0,i = 1,...,m i=1 Problem is convex, hence has unique minimum. Quadratic programming problem known solution method. For solution: α i > 0 only if x i is a support vector. 24 TDT4173 Machine Learning
Theoretical foundation The dual problem 1 Formal proofs of SVM properties available (but out of scope for us) 2 Large separators smart if we have small variations in x then we will still classify correctly 3 There are many skinny margin planes, only one if you look for the fattest plane; thus more robust. 25 TDT4173 Machine Learning
Non-separable subspaces What if the convex hulls are overlapping? If the convex hulls are overlapping we cannot find a linear separator To handle this, we optimise a criteria where we maximise distance between lines minus a penalty for mis-classifications This is equivalent to scaling the convex hulls, and do as before on the reduced convex hulls 26 TDT4173 Machine Learning
Non-separable subspaces What if the convex hulls are overlapping? (2) The problem with scaling is (modulo a constant) equivalent to this formulation: minimize wrt. α: 1 2 subject to Properties: m m m α i α j y i y j x T i x j i=1 j=1 i=1 m y i α i = 0 and that C α i 0,i = 1,...,m i=1 Problem as before, but C introduces the scaling; this is equivalent to incurring cost of misclassification. Still solvable using standard methods. Demo: Different values of C: http://svm.dcs.rhbnc.ac.uk/pagesnew/gpat.shtml 27 TDT4173 Machine Learning α i
Support Vector Machines StaBsBcal Learning Theory MisclassificaBon error and the funcbon complexity bound generalizabon error. Maximizing margins minimizes complexity. Eliminates overfirng. SoluBon depends only on Support Vectors not number of asributes. TDT 4173 Machine Learning and CBR
Nonlinearity and kernels Nonlinear problems when scaling does not make sense The problem is difficult to solve when x = (r,s) has only two dimensions...but if we blow it up to us five dimension: θ(x) = {r,s,rs,r 2,s 2 }, i.e. invent the mapping θ( ) : R 2 R 5, and try to find the linear separator in R 5, then everything is OK. 28 TDT4173 Machine Learning
Nonlinearity and kernels Solving the problem in higher dimensions We solve this as before, but remembering to look in the higher dimension: minimize wrt. α: 1 m m m α i α j y i y j θ(x i ) T θ(x j ) 2 subject to i=1 j=1 i=1 m y i α i = 0 and that C α i 0,i = 1,...,m i=1 Note that: We do not need to evaluate θ(x) directly, only θ(x i ) T θ(x j ). If we find a clever way of evaluating θ(x i ) T θ(x j ) (i.e., independent of the size of the target space) we can solve the problem easily, and without even thinking about what θ(x) even means. We define K(x i,x j ) = θ(x i ) T θ(x j ), and focus on finding K(, ) instead of the mapping. K is called a kernel. 29 TDT4173 Machine Learning α i
Kernel functions Support Vector Machines Nonlinearity and kernels θ(x) K(θ(x i ),θ(x j )) Degree d polynomial (x T i x ( j + 1) d ) (x Radial Basis Functions exp i x j ) 2 Two-layer Neural Network sigmoid (η x T i x j + c) Different kernels have different properties, and finding the right kernel is a difficult task, and can be hard to visualise. Example: The RBF kernel uses (implicitly) an infinitely dimensional representation for θ( ). 2σ 30 TDT4173 Machine Learning
SVMs: Algorithmic summary Nonlinearity and kernels Select the parameter C (tradeoff between minimising training set error and maximising the margin). Select kernel function, and associated parameters (e.g., σ for RBF). Solve the optimisation problem using quadratic programming. Find the value b by using the support vectors. Classify a new point x using { m } f(x) = sign y i α i K(x,x i ) b i=1 Demo: Different kernels http://svm.dcs.rhbnc.ac.uk/pagesnew/gpat.shtml 31 TDT4173 Machine Learning
Support Vector Machines SVM Extensions Regression Variable SelecBon BoosBng Density EsBmaBon Unsupervised Learning Novelty/Outlier DetecBon Feature DetecBon Clustering October 2 4, 2000 M2000 6 TDT 4173 Machine Learning and CBR
Support Vector Machines Many Other ApplicaBons Speech RecogniBon Data Base MarkeBng Quark Flavors in High Energy Physics Dynamic Object RecogniBon Knock DetecBon in Engines Protein Sequence Problem Text CategorizaBon Breast Cancer Diagnosis See: hsp://www.clopinet.com/isabelle/projects/ SVM/applist.html TDT 4173 Machine Learning and CBR
Support Vector Machines Hallelujah! GeneralizaBon theory and pracbce meet General methodology for many types of problems Same Program + New Kernel = New method No problems with local minima Few model parameters. Selects capacity. Robust opbmizabon methods. Successful ApplicaBons BUT TDT 4173 Machine Learning and CBR
Support Vector Machines HYPE? Will SVMs beat my best hand tuned method Z for X? Do SVM scale to massive datasets? How to chose C and Kernel? What is the effect of asribute scaling? How to handle categorical variables? How to incorporate domain knowledge? How to interpret results? TDT 4173 Machine Learning and CBR