ν =.1 a max. of 10% of training set can be margin errors ν =.8 a max. of 80% of training can be margin errors

Size: px

Start display at page:

Download "ν =.1 a max. of 10% of training set can be margin errors ν =.8 a max. of 80% of training can be margin errors"

Evelyn Sherman
5 years ago
Views:

1 p.1/1 ν-svms In the traditional softmargin classification SVM formulation we have a penalty constant C such that 1 C size of margin. Furthermore, there is no a priori guidance as to what C should be set to - the default is a value of 1. However, the precise value needs to be determined experimentally.

2 p.2/1 ν-svms Schölkopf et al. suggest an alternative formulation of softmargin SVMs based on the ν parameter a with ν [0, 1]. The advantages of the ν parameter formulation are that it represents an upper bound on the fraction of number of margin errors allowed, ν =.1 a max. of 10% of training set can be margin errors ν =.8 a max. of 80% of training can be margin errors and that it is proportional to the size of the margin, ν size of margin This implies that determining a value for ν is a more intuitive process that finding a value for the penalty constant C. a B. Schölkopf, A. Smola, R. C. Williamson, and P. L. Bartlett. New Support Vector Algorithms. Neural Computation, 12: , 2000.

3 p.3/1 ν-svc We can formulate the ν-svc a problem in the primal version as follows, min w,ξ,ρ,b φ(w, ξ, ρ) = 1 2 w w νρ + 1 l ξ i subject to y i (w x i b) ρ ξ i ξ i 0 ρ 0 Here ξ represents the set of slack variables as before. Observations: We no longer have a constant margin of value 1, instead we consider the size of the margin an explicit optimization variable - ρ. Observe that if ξ = 0 then the margin is 2ρ/w w. We don t directly penalize the size of the margin errors, instead we penalize the size of the margin - term νρ. a ν-svc means ν support vector classification.

4 p.4/1 Dual ν-svc The Lagrangian, L(α, β, δ, w, ξ, ρ, b) = 1 2 w w νρ + 1 l ξ i α i (y i (w x i b) ρ + ξ i ) β i ξ i δρ with α i, β i,δ 0. Where the optimization problem is max α,β,δ min L(α, β, δ, w, ξ, ρ, b). w,ξ,ρ,b

5 KKT Conditions A solution α, β,δ, w, ξ,b, and ρ has to satisfy the KKT conditions, L w (α, β, δ, w, ξ, ρ, b) =0, L (α, β, δ, w, ξ i ξ,ρ,b)=0, i L ρ (α, β, δ, w, ξ, ρ,b)=0, L b (α, β, δ, w, ξ, ρ, b )=0, α i (y i(w x i b )+ξ i ρ )=0, β i ξ i =0, δ ρ =0, y i (w x i b )+ξ i ρ 0, α i 0, β i 0, δ 0, for i =1,...,l. ξ i 0, p.5/1

6 p.6/1 Dual ν-svc Taking the partial derivatives of L(α, β, δ, w, ξ, ρ, b) with respect to the primal variables and setting them to 0 we obtain, w = α i y i x i α i + β i = 1 l α i y i =0 α i = ν + δ Plugging these back into the Lagrangian gives us our dual optimization problem.

7 p.7/1 Dual ν-svc This gives us the a training algorithm for softmargin ν-svc with the kernel k(x i, x j ) substituted for the dot product in input space, max α φ (α) =max α y i y j α i α j k(x i, x j ) A j=1 subject to the constraints, y i α i =0 α i ν 1/l α i 0,i =1,...,l Compared to the dual optimization problem of C-SVCs we have two differences: (a) we lost the term Σα i in the objective function and (b) we have an additional constraint due to ρ.

8 p.8/1 Dual ν-svc Turns out that our decision function stays the same as in the C classifiers, ˆf(x) =sign! α i y ik(x i, x) b. Here, as before, b can be computed from support vectors that are not bound, 0 <α i < 1/l.

9 ν-svc (source: "Learning with Kernels", Schölkopf and Smola, MIT, 2002) p. 9/1

10 p. 10/1 ν-svc (source: "Learning with Kernels", Schölkopf and Smola, MIT, 2002)

11 p. 11/1 ν-svr In ν-svr we want to have our ε automatically computed. This gives rise to the following primal optimization problem! min φ(w, ξ, ξ,ε,b)= 1 w,ξ,ξ,ε,b 2 w w + C νε + 1 nx (ξ i + ξ i n ) subject to (w x i b) y i ε + ξ i y i (w x i b) ε + ξ i ξ i 0 ξ i 0 ε 0 Notice that here the term νε determines how much the size of the ε tube contributes to the optimization problem.

12 p. 12/1 Dual ν-svr This gives rise to the dual, max α,α φ (α, α )=max α,α (α i α i )y i 1 2 (α i α i )(α j α j )k(x i, x j ) j=1 subject to the constraints, (α i α i )=0 (α i + α i) C ν C/l α i,α i 0,i =1,...,l Our model is, ˆf(x) = (α i α i )k(x i, x) b.

13 ν-svr p. 13/1

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training