Approximation Theoretical Questions for SVMs

Size: px

Start display at page:

Download "Approximation Theoretical Questions for SVMs"

Roger Martin Briggs
5 years ago
Views:

1 Ingo Steinwart LA-UR October 20, 2007

2 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually Y R. Already observed samples T = ( (x 1, y 1 ),..., (x n, y n ) ) (X Y ) n Goal: With the help of T find a function f : X R which predicts label y for new, unseen x.

3 Statistical Learning Theory: an Overview Support Vector Machines Illustration: Binary Classification Problem: The set X is devided into two unknown classes X 1 and X 1. Goal: Find approximately the classes X 1 and X 1. IIllustration: X 1 X 1 {f= 1} {f=1} X 1 X 1 Left: Negative (blue) and positive (red) samples. Right: Behaviour of a decision function (green) f : X Y.

4 Formal Definition of Statistical Learning Statistical Learning Theory: an Overview Support Vector Machines Basic Assumptions: P is an ( unknown probability measure on X Y. T = (x1, y 1 ),..., (x n, y n ) ) (X Y ) n sampled from P n. Future (x, y) will also be sampled from P. L : Y R [0, ] loss function that measures cost L(y, t) of predicting y by t. Goal: Find a function f T : X R with small risk R L,P (f T ) := X Y L ( y, f T (x) ) dp(x, y). Interpretation: Average future cost of predicting by f T should be small.

5 Questions in Statistical Learning I Bayes risk: Statistical Learning Theory: an Overview Support Vector Machines R L,P := inf{ R L,P (f ) f : X R measurable }. A function attaining this minimum is denoted by f L,P. Learning method: Assigns to every training set T a predictor f T : X R. Consistency: A learning method is called universally consistent if R L,P (f T ) R L,P in probability (1) for n and every probability measure P on X Y. Good news: Many learning methods are universally consistent. First result: Stone (1977), AoS

6 Statistical Learning Theory: an Overview Support Vector Machines Questions in Statistical Learning II Rates: Does there exist a learning method and a convergence rate a n 0 such that E T P nr L,P (f T ) R L,P C P a n, n 1, for every probability measure P on X Y. Bad news: (Devroye, 1982, IEEE TPAMI) No! (if Y 2, X =, and L non-trivial ) Good news: Yes, if one makes some mild?! assumptions on P. Too many results in this direction to mention them.

7 Reproducing Kernel Hilbert Spaces I Statistical Learning Theory: an Overview Support Vector Machines k : X X R is a kernel : there exist a Hilbert space H and a map Φ : X H with k(x, x ) = Φ(x), Φ(x ) for all x, x X. all (k(x i, x j )) n i, j=1 are symmetric and positive semi-definite. RKHS of k: the smallest such H that consists of functions. Construction : Take the completion of { n } α i k(x i,.) : n N, α 1,..., α n R, x 1,..., x n X i=1 equipped with the dot product n m α i k(x i,.), β j k(ˆx j,.) := i=1 j=1 Feature map: Φ : x k(x,.). n i=1 j=1 m α i β j k(x i, ˆx j ).

8 Statistical Learning Theory: an Overview Support Vector Machines Reproducing Kernel Hilbert Spaces II Polynomial Kernels: For a 0 and m N let k(x, x ) := ( x, x + a) m, x, x R d. Gaussian RBF kernels: For σ > 0 let k σ (x, x ) := exp( σ 2 x x 2 2), x, x R d. The parameter 1/σ is called width. Denseness of Gaussian RKHSs: The RKHS H σ of k σ is dense in L p (µ) for all p [1, ) and all probability measures µ on R d.

9 Support Vector Machines I Statistical Learning Theory: an Overview Support Vector Machines Support vector machines (SVMs) solve the problem f T,λ = arg min f H λ f 2 H + 1 n n L ( y i, f (x i ) ), (2) i=1 H is a( RKHS, T = (x1, y 1 ),..., (x n, y n ) ) (X Y ) n is a training set, λ > 0 is a free regularization parameter, L : Y R [0, ) is a convex loss, e.g. hinge loss: L(y, t) := max{0, 1 yt} least squares loss: L(y, t) := (y t) 2. Representer Theorem: The unique solution is of the form f T,λ = n i=1 α ik(x i,.). Minimization actually takes place over {α 1,..., α n }.

10 Statistical Learning Theory: an Overview Support Vector Machines Support Vector Machines II Questions: Universally consistent? Learning rates? Efficient algorithms? Performance on real world problems? Additional properties?

11 An Oracle Inequality: Assumptions An Oracle Inequality Consequences Assumptions and notations: L(y, 0) 1 for all y Y. L(y,.) : R [0, ) convex and has a minimum in [ 1, 1]. t := max{ 1, min{1, t}}. L is locally Lipschitz: L(y, t) L(y, t ) t t, y Y, t, t [ 1, 1]. This yields L(y, t) L(y, t) L(y, 0) + L(y, 0) 2 Variance bound: ϑ [0, 1] and V 2 f : X R: E P ( L f L f L,P ) 2 V ( EP (L f L f L,P )) ϑ

12 Entropy Numbers An Oracle Inequality Consequences Let S : E F be a bounded linear operator and n 1. The n-th (dyadic) entropy number of S is defined by e n (S) := inf { ε > 0 : x 1,..., x 2 n 1 : SB E (x i + εb F ) }. 2 n 1 i=1

13 An Oracle Inequality An Oracle Inequality Consequences Oracle Inequality (slightly simplified) H separable RKHS of measurable kernel with k 1. Entropy assumption: p (0, 1) and a 1: E TX P n X e i(id : H L 2 (T X )) a i 1 2p, i, n 1. Fix an f 0 H and a B 0 1 such that L f 0 B 0, Then there exists a constant K > 0 such that with probability P n not less than 1 e τ we have R L,P ( f T,λ ) R L,P 9 ( λ f 0 2 H + R L,P(f 0 ) R ) L,P ( a 2p ) 1 ( 2 p ϑ+ϑp 72V τ ) 1 2 ϑ +K λ p B 0τ n n n

14 A Simplification An Oracle Inequality Consequences Consider the approximation error function ( ) A(λ) := min λ f 2 H + R L,P(f ) R L,P f H and the (unique) minimizer f P,λ. = For f 0 := f P,λ we can choose B 0 = Refined Oracle inequality ( a R L,P ( f T,λ ) R 2p L,P 9A(λ) + K λ p n ( 72V τ +3 n ) 1 2 ϑ + 30τ n A(λ) λ ) 1 2 p ϑ+ϑp + 60τ n A(λ) λ

15 An Oracle Inequality Consequences Consistency Assumptions: L(y, t) c(1 + t q ) for all y Y and t R. H is dense in L q (P X ) = A(λ) 0 for λ 0. = SVM is consistent whenever we chose λ n such that λ n 0 sup nλ n <.

16 Learning Rates An Oracle Inequality Consequences Assumptions: There exists constants c 1 and β (0, 1] such that A(λ) cλ β, λ 0. Note: β = 1 = f L,P H. L is Lipschitz continuous (e.g. hinge loss). = Choosing λ n n α we obtain a polynomial learning rate. Zhou et al. (2005?) Some calculations show that the best learning rate we can obtain is It is achieved by 2β min{ n β+1, β β(2 p ϑ+ϑp)+p }. λ n n min{ 2 β+1, 1 β(2 p ϑ+ϑp)+p }. But β, and (often Ingo Steinwart alsola-ur ϑ and p) areapproximation unknown! Theoretical Questions for SVMs

17 An Oracle Inequality Consequences Adaptivity For T = ((x 1, y 1 ),..., (x n, y n )) define m := n/2 + 1 and T 1 := ((x 1, y 1 ),..., (x m, y m )) T 2 := ((x m+1, y m+1 ),..., (x n, y n )). Split T into T 1 and T 2. Fix an n 2 net Λ n of (0, 1]. Training: Use T 1 to find f T1,λ, λ Λ n. Validation: Use T 2 to determine a λ T2 Λ n that satisifes R L,T2 ( f T1,λ T2 ) = min λ Λ n R L,T2 ( f T1,λ). = This yields a consistent learning method with learning rate 2β min{ n β+1, β β(2 p ϑ+ϑp)+p }.

18 An Oracle Inequality Consequences Discussion The oracle inequality can be generalized to regularized risk minimizers. The presented oracle inequality yields fastest known rates in many cases. In some cases these rates are known to be optimal in a min-max sense. Oracle inequalities can be used to design adaptive strategies that learn fast without knowing key parameters of P. Question: Which distributions can be learned fast?

19 An Oracle Inequality Consequences Discussion II Observations: Data often lies in high dimensional spaces, but not uniformly. Regression: target is often smooth (but not always). Classification: How much do classes overlap?

20 Observations The Single Kernel Case Gaussian Kernels The relation between RKHS H and distribution P is descibed by two quantities: The constants a and p in E TX P n X e i(id : H L 2 (T X )) a i 1 2p, i, n 1. The approximation error function ( ) A(λ) := min λ f 2 H + R L,P(f ) R L,P f H Task: Find realistic assumptions on P such that both quantities are small for commonly used kernels.

21 Entropy Numbers The Single Kernel Case Gaussian Kernels Consider the integral operator T k : L 2 (P X ) L 2 (P X ) defined by T k f (x) := k(x, x )f (x ) P X (dx ) Question: What is the relation between the EW s of T k and X E TX P n X e i(id : H L 2 (T X ))? Question: What is the behaviour if X R d but P X is not absolutely continuous with respect to the Lebesgue measure?

22 Approximation Error Function The Single Kernel Case Gaussian Kernels For the least squares loss we have A(λ) = inf f H λ f 2 H + f f L,P 2 L 2 (P X ). For Lipschitz continuous losses we have have A(λ) inf f H λ f 2 H + f f L,P L 1 (P X ). Smale & Zhou 03: In both cases the behaviour of A(λ) for λ 0 can be characterized by the K-functional of the pair (H, L p (P X )). Questions: What happens if X R d but P X is not absolutely continuous with respect to the Lebesgue measure?

23 The Single Kernel Case Gaussian Kernels Observations: The Gaussian kernel is successfully used in many applications. It has a parameter σ that is almost never fixed a-priori. Question: How does σ influence the learning rates?

24 The Single Kernel Case Gaussian Kernels Appproximation Quantities for Gaussian Kernels For H σ being the Gaussian kernel with width σ the entropy assumption is of the form E TX P n X e i(id : H σ L 2 (T X )) a σ i 1 2p, i, n 1. The approximation error function also depends on σ: A σ (λ) = inf f H σ λ f 2 H σ + R L,P (f ) R L,P.

25 Oracle Inequality The Single Kernel Case Gaussian Kernels Oracle inequality using Gaussian kernels R L,P ( f T,λ ) R L,P 9 ( λ f 0 2 H σ + R L,P (f 0 ) R ) L,P ( 72V τ +K ( a 2p σ λ p n ) 1 2 p ϑ+ϑp +3 n ) 1 2 ϑ + 30B 0(σ)τ n Usually σ becomes larger with the sample size. Task: Find estimates that are good in σ and i (or λ), simultaneously.

26 The Single Kernel Case Gaussian Kernels An Estimate for the Entropy Numbers Theorem: (S. & Scovel, AoS 2007) Let X R d be compact. Then for all ε > 0 and 0 < p < 1 there exists a constant c ε,p 1 such that E TX P n X e i(id : H L 2 (T X )) c ε,p σ (1 p)(1+ε)d 2p i 1 2p This estimate does not consider properties of P X. Questions: How good is this estimate? For which P X can this be significantly improved?

27 Distance to the Decision Boundary The Single Kernel Case Gaussian Kernels η(x) := P(y = 1 x). X 1 := {η < 1/2} and X 1 := {η > 1/2}. For x X R d we define d(x, X 1 ), if x X 1, (x) := d(x, X 1 ), if x X 1, 0 otherwise, (3) where d(x, A) denotes the distance between x and A. Interpretation: (x) measures the distance of x to the decision boundary.

28 Margin Exponents The Single Kernel Case Gaussian Kernels Margin exponent: c 1 and α > 0 such that P X ( (x) t) ct α, t > 0. Example: X R d compact, positive volume. PX uniform distribution. Decision boundary linear or circle. = α = 1. This remains true under transformations Margin-noise exponent: c 1 and β > 0 such that 2η 1 P X ( (x) t) ct α, t > 0. Interpretation: The exponentingo βsteinwart measures LA-UR how much Approximation 2η 1 dp Theoretical X isquestions concentrated for SVMs

29 The Single Kernel Case Gaussian Kernels A Bound on the Approximation Error Theorem: (S. & Scovel, AoS 2007) X R d compact. P distribution on X { 1, 1} with margin-noise exponent β. L hinge loss. c d,τ > 0 and c d,β > 0 σ > 0 and λ > 0 f H σ satisfying f 1 and Remarks: λ f 2 H σ + R L,P (f ) R L,P c d,τ λσ d + c d,β c σ β. Not optimal in λ. How can this be improved? Better dependence on dimension?!

30 Conclusion Oracle inequalities can be used to design adaptive SVMs. For which distributions do such adaptive SVMs learn fast? Bounds for entropy numbers Bounds for approximation error

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description