Support Vector Machines

Size: px

Start display at page:

Download "Support Vector Machines"

Alexia Moore
5 years ago
Views:

1 Support Vector Machines Stephan Dreiseitl University of Applied Sciences Upper Austria at Hagenberg Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support

2 Overview Motivation Statistical learning theory VC dimension Optimal separating hyperplanes Kernel functions Performance evaluation

3 Motivation Given data D = {(x i,t i )} distributed according to P(x,t), which is better performance on test set?

4 Motivation Neural networks model p(t x) by Topology restriction Early stopping Weight decay Bayesian approach SVM: capacity control Based on statistical learning theory

5 Statistical learning theory Given data D = {(x i,t i )}, model output y(α,x i ) {+1,-1}, class labels t i {+1,-1} Fundamental question: Is learning consistent? Can we infer performance on test set (generalization error) from performance on training set?

6 Statistical learning theory Average error on a data set D for model with parameter α: n 1 R emp (α ) = 2 n y(α, x i ) t i i =1 Expected error of same model given unseen data distributed like D: 1 R(α ) = 2 y(α, x) t dp( x, t)

7 Statistical learning theory How can we relate R emp and R? Generalization error R(α) depends on empirical error R emp (α) and capacity h of model With probability 1-η : R(α ) R emp (α ) + h(log(2n / h) + 1) log(η / 4) n h is VC dimension (Vapnik-Chervonenkis)

8 VC dimension Capacity measure for classification models Largest number of points that can be shattered by model Classifier shatters data points if for any labeling, points can be separated Not the same as number of parameters in model!

9 Shattering Straight lines can shatter 3 points in 2-space Model: sign(α x)

10 Other model: sign(x x - α) Shattering Still other model: sign(βx x - α)

11 VC dimension Largest number of points for which there is arrangement that can be shattered For straight lines in 2-space, VC dim. is 3 For hyperplanes in n-space, VC dim. is n + 1 There is model with one parameter and infinite VC dimension

12 Structural risk minimization R(α ) R emp (α ) + h(log(2n / h) + 1) log(η / 4) n Fix data set and order classifiers according to their VC dimension For each classifier, train and calculate right-hand side Best classifier minimizes right-hand side

13 Structural risk minimization R(α ) R emp (α ) + h(log(2n / h) + 1) log(η / 4) n Model R emp VC conf. Upper bound f 1 (α) f 2 (α) f 3 (α) best f 4 (α) f 5 (α)

14 Model selection Cross-validation: use test sets to estimate error Penalize model complexity: Akaike information criterion (AIC) Bayesian information criterion Structural risk minimization

15 Support Vector Machines Implement hyperplanes, so know VCdimension Algorithmic representation of concepts from statistical learning theory Determine hyperplanes that maximize margin between classes

16 Geometry of hyperplanes

17 Geometry of hyperplanes Hyperplanes invariant to scaling of parameters: {x w x + w 0 = 0} = {x c w x + c w 0 = 0} 3x y 4 = 0 6x 2y 8 = 0

18 Geometry of hyperplanes 3x y 4 = 0 6x 2y 8 = 0 3 * 2 (-4) 4 = 6 6 * 2 2*(-4) 8 = 12

19 We want Optimal hyperplanes w x i + w 0 1 for all x i in class 1 (t i =+1) w x i + w 0-1 for all x i in class 2 (t =-1) x -1 0 x +1 x x

20 Optimal hyperplanes Points on dashed lines satisfy g(x) = +1 resp. g(o) = -1 Margin = g(x) / w + g(o) / w = 2 / w Largest (optimal) margin: maximize 2 / w equiv. to minimize w 2 subject to t i (w x i + w 0 ) 1 0

21 Optimal hyperplanes Optimal hyperplane has largest margin ( large margin classifiers ) Parameter estimation problem turned into constrained optimization problem Unique solution w = Σα i x i over all inputs x i on the margin ( support vectors ) Decision function g(x) = sign(σα i x i x + w 0 ) All other cases x j irrelevant to solution!

22 Nonseparable data sets Introduce slack variables ξ i 0 Constraints are then w x i + w ξ i for all x i in class 1 w x i + w ξ i for all x i in class 2 minimize w 2 + CΣξ i

23 Nonlinear SVM Idea: Nonlinearly project data into higher dimensional space with Φ: R m H Apply optimal hyperplane algorithm in H

24 Nonlinear SVM example Idea: Project R 2 R 3 via Φ(x 1,x 2 ) = (x 12, 2 x 1 x 2,x 22 )

25 Nonlinear SVM example Do the math: 2 2 Φ x x1 y1 1 y Φ 1 = 2 x x y y 1 2 x 2 y x 2 y 2 = x 1 2 y x 1 x 2 y 1 y 2 + x 2 2 y 2 2 = ( x 1 y 1 + x 2 y 2 ) 2 = x 1 y 1 x 2 y 2 2

26 Nonlinear SVM Recall: Input data x i enters calculation only via dot products x i x j or Φ(x i ) Φ(x j ) Kernel trick: K(x i,x j ) = Φ(x i ) Φ(x j ) Advantage: no need to calculate Φ Advantage: no need to know H What are admissible kernel functions?

27 Kernel functions Satisfy Mercer s condition Most widely used (+parameters): Polynomials (degree) Gaussians (variance) For given kernel function K, projection Φ and projection space H not unique

28 Kernel function example Data space R 2 ; x, y R 2 K(x,y) = (x y) 2 Possible H=R 3 Possible Φ(x 1,x 2 ) = (x 12, 2 x 1 x 2,x 22 )

29 Kernel function example K(x,y) = (x y) 2

30 SVM examples Linearly separable C=100

31 SVM examples C=100 C=1

32 SVM examples Linear function Quad. polynomial

33 SVM examples Quad. poly., C=10 Quad. poly., C=100

34 SVM examples Cubic polynomial Gaussian, σ = 1

35 SVM examples Quad. polynomial Cubic polynomial

36 SVM examples Cubic polynomial Degree 4 poly.

37 SVM examples Gaussian, σ = 1 Gaussian, σ = 3

38 Performance comparison Log. regression ANN SVM Real-world data set 1619 lesion images 107 morphometric features: Global (size, shape) size shape Local (color distributions) Use ROC analysis

39 CN DN + MM Logistic regression: Artificial neural networks: Support vector machines (polynomial kernel): to Support vector machines (Gaussian kernel): to 0.831

40 MM CN + DN Logistic regression: Artificial neural networks: Support vector machines (polynomial kernel): to Support vector machines (Gaussian kernel): to 0.970

41 Summary SVM based on statistical learning theory Bounds on generalization performance Optimal separating hyperplanes Kernel trick (projection) Performs comparable to log. regression and neural networks

42 Pointers to the literature Burges C. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery. 1998; 2(2): Christianini N, Shawe-Taylor J. An introduction to support vector machines. Cambridge University Press Vapnik V. Statistical learning theory. Wiley Interscience 1998.

Artificial Neural Networks

Artificial Neural Networks Stephan Dreiseitl University of Applied Sciences Upper Austria at Hagenberg Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support Knowledge