References Lecture 7: Support Vector Machines Isabelle Guyon guyoni@inf.ethz.ch An training algorithm for optimal margin classifiers Boser-Guyon-Vapnik, COLT, 992 http://www.clopinet.com/isabelle/p apers/colt92.ps.z Book chapters and 2 Software LibSVM http://www.csie.ntu.edu.tw/~cjlin/li bsvm/ Perceptron Learning Rule Optimum Margin Perceptron w j w j + y i x ij If x is misclassified x j wj Σ y w j w j + y i x ij If x is the least well classified x j wj Σ y w j = Σ i α i y i x ij Converges to one solution classifying well all the examples. w j = Σ i α i y i x ij Converges to the optimum margin Perceptron (Mezard-Krauth, 988). QP formulation (vapnik, 962).
Optimum Margin Solution Negative Margin Unique solution. Depends only on support vectors (SV) w j = Σ i SV α i y i x ij SVs are examples closest to the boundary. Bound on leave-one-out error: LOO n SV /m Most stable, good from MDL point of view. But: sensitive to outliers and works only for linearly separable case. Multiple negative margin solutions, which all give the same number of errors. Soft-Margin Non-Linear Perceptron x φ (x) Rosenblatt, 957 Examples within the margin area incurr a penalty and become nonmarginal support vectors. Unique solution again (Cortes-Vapnik, 995): w j = Σ i SV α i y i x ij x 2 x n φ 2 (x) φ N (x) w w 2 w N b Σ f(x) f(x) = w F(x) + b 2
Kernel Trick Kernel Method f(x) = w F(x) x k(x,x) Potential functions, Aizerman et al 964 w = Σ i α i y i F(x i ) Dual forms Aizerman-Braverman-Rozonoer -964 x 2 k(x 2,x) α α 2 Σ f(x) = Σ i α i y i k(x i, x) k(x i, x) = F(x i ) F(x) x n k(x m,x) α m b f(x) = Σ i α i k(x i,x) + b k(.,. ) is a similarity measure or kernel. Some Kernels (reminder) Support Vector Classifier A kernel is a dot product in some feature space: k(s, t) = F(s) F(t) x 2 Examples: k(s, t) = s t Linear kernel k(s, t) = exp -γ s-t 2 Gaussian kernel k(s, t) = exp -γ s-t Exponential kernel k(s, t) = ( + s t) q Polynomial kernel k(s, t) = ( + s t) q exp -γ s-t 2 Hybrid kernel f(x)< f(x)= f(x)> f(x) = Σ α j y j k(x, x j ) k SV x x=[x,x 2 ] Boser-Guyon-Vapnik-992 3
Margin and w Quadratic Programming Hard margin: min w 2 y j (w.x j +b) for all examples. Margin Maximizing the margin is equivalent to minimizing w. Soft margin: min w 2 + C Σ j ξ β j y j (w.x j +b) (-ξ j ) for all examples. ξ j, β=,2 Dual Formulation Ridge SVC Non-linear case: x F(x) min w 2 + CΣ j ξ j β ξ j, β= or β= 2 y j (w.f(x j )+b) (-ξ j ) for all examples. max -½ a T K a + a K= [ y i y j k(x i,x j )] + (/C) δ ij a T y = ; α i C Soft margin: min w 2 + CΣ j ξ j β ξ j, β=,2 y j (w.f(x j )+b) (-ξ j ) for all examples. -y f(x j ) j f(x j )<, ξ j =, no penalty, not SV -y j f(x j )=, ξ j =, no penalty, marginal SV Ridge SVC: -y j f(x j )=ξ j >, penalty ξ j, non-marginal SV Loss L(x j )=max (, -y j f(x j )) β Risk Σ i L(x j ) min (/C) w 2 + Σ j L(x j ) regularized risk 4
Ridge Regression (reminder) Structural Risk Minimization Sum of squares: R = Σ i (f(x i ) - y i ) 2 = Σ i (-y i f(x i )) 2 Add regularizer : Classification case R = Σ i (-y i f(x i )) 2 + λ w 2 Compare with SVC: R =Σ i max(, -y j f(x j )) β + λ w 2 Nested subsets of models, increasing complexity/capacity: S S 2 S N Vapnik-984 R Example, with w 2 S k = { w w 2 < A k }, A <A 2 < <A k Minimization under constraint: min R emp [f] s.t. w 2 < A k capacity Lagrangian: R reg [f] = R emp [f] + λ w 2 Radius-margin bound: LOOcv = 4 r 2 w 2 Vapnik-Chapelle-2 Loss Functions Regularizers L(y, f(x)) 3.5 Adaboost loss e -z 2.5 logistic loss.5 log(+e -z ) Perceptron.5 loss 4 3 2 max(, -z) Decision boundary Margin - -.5.5.5 2 missclassified SVC loss, β=2 max(, (- z) 2 ) / loss SVC loss, β= max(, -z) well classified square loss (- z) 2 z=y f(x) w 2 2 =Σ i w i 2 : 2-norm regularization (ridge regression, original SVM) w =Σ i w i : -norm regularization (Lasso Tibshirani 996, -norm SVM 965) w =length(w) : -norm (Weston et al., 23) 5
Regression SVM Epsilon insensitive loss: y i f(x i ) ε Unsupervised learning SVMs for: density estimation: Fit F(x) (Vapnik, 998) finding the support of a distribution (one-class SVM) (Schoelkopf et al, 999) novelty detection (Schoelkopf et al, 999) clustering (Ben Hur, 2) Summary For statistical model inference, two ingredients needed: A loss function: defines the residual error, i.e. what is not explained by the model; characterizes the data uncertainty or noise. A regularizer: defines our prior knowledge, biases the solution; characterizes our uncertainty about the model. We usually bet on simpler solutions (Ockham s razor). Exercise Class 6
Filters: see chapter 3 Homework 7 ) Download the software for homework 7. 2) Inspiring your self by the examples, write a new feature ing filter object. Choose one in Chapter 3 or invent your own. 3) Provide the pvalue and (using a tabulated distribution or the probe method). 4) Email a zip file your object and a plot of the to guyoni@inf.ethz.chwith subject "Homework7" no later than: Tuesday December 3th. 4 2 Size Type Features.9 MB Sparse integer Dexter DEXTER 5 5 2 25 3 35 4 45 5 Best entries: BER~3.3-3.9% Frac_feat~.5% DEXTER filters texts Training Examples AUC~.97-.99 Validation Examples Frac_probe~5% Test Examples 2 3 3 2 BER Baseline Dexter my_classif=svc({'coef=', 'degree=', 'gamma=', 'shrinkage=.5'}); my_model=chain({s2n('f_max=3'), normalize, my_classif}) Results: % feat.5 % probe 6.33 Train BER%.33 Valid 7 Test BER% BER% 5 Train Valid AUC AUC.982 Test AUC.988 7
Evalution of pval and Ttest object: computes pval analytically ~pval*n sc /n probe object: takes any feature ing object as an argument (e.g. s2n, relief, Ttest) pval~n sp /n p ~pval*n sc /n.4.2..8.6.4.2 Arcene 5 5 2 253 354 45 5.9.8.7.6.5.4.3.2 Gisette Analytic vs. probe Red analytic Blue probe.8.7.6.5.4.3.2. Dexter 5 52 253354 455.9.8.7.6.5.4.3.2 2 Madelon x -4 Dorothea 5 52 25335 4455. 5 5 2 253 35 445 5. 5 5 2 25 3 35 4 45 5 Relief (Fraction of features falsely found significant) 8 x -3 Relief vs. Ttest 7 6 5 4 3 2 5 5 2 25 3 35 4 45 5 Rank (Number of features found significant) (Dashed blue: Ttest with probes) 8