Robust Support Vector Machines for Probability Distributions

Size: px

Start display at page:

Download "Robust Support Vector Machines for Probability Distributions"

Dina Reynolds
5 years ago
Views:

1 Robust Support Vector Machines for Probability Distributions Andreas Christmann joint work with Ingo Steinwart (Los Alamos National Lab) ICORS 2008, Antalya, Turkey, September 8-12, 2008 Andreas Christmann, 1

2 Applications 1 web mining classification of WWW sites 2 text mining classification of text files levels: sub-word, word, multi-word, semantic 3 classification of images detection of abnormal structures (medicine) detection of handwritten digits 4 statistical model choice: goodness-of fit evaluated for distributions/models Andreas Christmann, 2

3 Applications in common: classification or regression input space X, output space Y R X := M 1 ( X, B X) set of all distributions on X R d data: D = ( (x 1, y 1 ),..., (x n, y n ) ) (X Y ) n assume: i.i.d. random variables (X i, Y i ) with obs. (x i, y i ) distribution P of (X i, Y i ) totally unknown n often very large Andreas Christmann, 3

4 Applications Goals: 1 good automatic prediction performance 2 efficient algorithms for training and prediction 3 consistency and robustness Empirically known: Support Vector Machines (SVMs) satisfy 1+2 Hein, Lal & Bousquet (2004) Hein & Bousquet (2005) Hein, Bousquet & Schölkopf (2005) Smola, Gretton, Song & Schölkopf (2007) Fukumizu, Bach & Jordan (2008),... Topic of this talk: consistency and robustness Andreas Christmann, 4

5 Support Vector Machines input space: X Polish space output space: Y R closed data: D = ( (x 1, y 1 ),..., (x n, y n ) ) (X Y ) n random variables: (X i, Y i ) i.i.d. distribution: P of (X i, Y i ) totally unknown X and Y Polish spaces and P can be split into P(y x) and P X (Dudley, 2002, Thm ) if ( X, d X) is a Polish space, then X = M 1 ( X, B X) is a Polish space (Billingsley, 1999, Thm. 6.8) Andreas Christmann, 5

6 Support Vector Machines loss: L : Y R [0, ) convex, measurable risk: R L,P (f) := E P L(Y, f(x)) SVM: S(P) := f P,λ := arg min f H R L,P(f) + λ f 2 H, where P M 1, H reproducing kernel Hilbert space, and λ > 0 Vapnik & Lerner (1963) Boser, Guyon & Vapnik (1992) Andreas Christmann, 6

7 Loss Functions for Classification Hinge Logistic Truncated Huber AdaBoost Andreas Christmann, 7

8 Loss Functions for Regression eps insensitive, eps=0.5 Logistic L L r Huber, c= r Pinball, tau=0.10 L L r r Andreas Christmann, 8

9 Kernels kernel k : X X R, if R-Hilbert space H and Φ : X H such that k(x, x ) = Φ(x), Φ(x ) H, x, x X canonical feature map Φ : X H, Φ(x) = k(, x) example: GRBF k(x, x ) = e x x 2 2 /γ2, γ > 0 reproducing kernel Hilbert space (RKHS) H: x X : δ x (f) := f(x), f H, is continuous reproducing kernel: for all f H, x X: k(, x) H and f(x) = f, k(, x) H Andreas Christmann, 9

10 Classical SVM based on Hinge Loss where α = (α 1,..., α n ) solves n f D,λ ( ) = α i k(, x i ), i=1 max α [0,C] n and C := 1 2λn n α i 1 2 i=1 n n α i α j y i y j k(x i, x j ) i=1 j=1 Andreas Christmann, 10

11 Kernels for Distributions Examples Hellinger: k(p 1, P 2 ) := X p1 (x)p 2 (x) dµ(x) Total Variation: k(p 1, P 2 ) := X min{p 1 (x)p 2 (x)} dµ(x) Properties positive definite symmetric based on Hilbertian metrics Hein & Bousquet (2004) but H can be too small Andreas Christmann, 11

12 Kernels for Distributions Examples Assume X R d bounded and X := M 1 ( X, B X). Let γ > 0. ( k(p 1, P 2 ) := exp P 1 P / ) 2 2 L 2 (λ d ) γ 2 ( / ) k(p 1, P 2 ) := exp E P1 Φ( X) EP2 Φ( X) 2 H γ 2, where k : X X R continuous, bounded, and pos. def. kernel with RKHS H and canonical feature map Φ Properties continuous, bounded, and positive definite symmetric k and Φ are measurable Andreas Christmann, 12

13 Questions Which properties must L, k, and S : P f P,λ have such that R L,P (f D,λn ) P smallest possible R L,P (f) f P,λ is robust? Andreas Christmann, 13

14 Results: Consistency Assume: X Polish space, Y R be closed, both enclipped with Borel σ-algebras THM (Steinwart & CHR 08) Assume L(y, t) = ψ(y t) convex, continuous, growth type p [1, ) k measurable, bounded H L p (P X ) dense, separable RKHS Let p := max{2p, p 2 } and (λ n ) with λ n 0 and nλ p n. Then R L,P (f D,λn ) P inf R L,P(f), n, f:x R measurable for all D = n and for all P M 1 (X Y ) with E P Y p <. Andreas Christmann, 14

15 Results: Influence Function Case A: L smooth THM (CHR & Steinwart 08) Assume H RKHS of bounded continuous kernel k on X with canonical feature map Φ L : Y R [0, ) convex and L, F 2 L, F 2,2L are P-integrable Nemitski loss functions. Then: IF((x, y); S, P) = E P F 2 L(Y, f P,λ (X))M 1 Φ(X) F 2 L(y, f P,λ (x)) M 1 Φ(x) where M = 2λ id H + E P F 2,2L ( Y, f P,λ (X) ) Φ(X), H Φ(X). Andreas Christmann, 15

16 Case B: L may have corners DEF (CHR & Van Messem 08) Let H be a Hilbert space. The Bouligand influence function (IF B ) of a function S : P S(P) H for a distribution P in the direction of a distribution Q P is the special Bouligand derivative lim ε 0 S ( (1 ε)p + εq ) S(P) IFB (Q; S, P) H ε = 0. Special case Q = z : if IF B exists, then IF exists and IF B =IF Goal: bounded IF B Andreas Christmann, 16

17 Results: Bouligand Influence Function THM (CHR & Van Messem 08) Consider regression model and assume: X separable Banach space [in the paper: X R d ] L convex, Lipschitz continuous, B 2 L and B 2,2L bounded k measurable, bounded, and... (see paper). Then: IF B (Q; S, P) with S(P) := f P,λ is bounded and IF B (Q; S, P) = M 1( E P B 2 L(Y, f P,λ (X))Φ(X) ) M 1( E Q B 2 L(Y, f P,λ (X))Φ(X) ) where M = 2λ id H + E P B 2,2L(Y, f P,λ (X)) Φ(X), H Φ(X). Andreas Christmann, 17

18 Summary Support Vector Machines 1 can even be used if input values are distributions 2 are able to learn (L-risk consistent) 3 are robust (if k bounded and L Lipschitzian). Andreas Christmann, 18

19 References Steinwart & Christmann (2008). Support Vector Machines. Springer, New York. Christmann & Van Messem (2008). J. Mach. Learn. Res., 9, Christmann & Steinwart (2007). Bernoulli, 13, Fukumizu, Bach & Jordan (2008). Ann. Statist. (to appear) Hein &, Bousquet (2004). In: Proceedings of AISTATS 2005, Hein, Bousquet & Schölkopf (2005). J. Computer System Sciences, 71, Hein, Lal & Bousquet (2004). In: Proc. 26th DAGM Symposium, , Springer, Smola, Gretton, Song & Schölkopf (2007). Algorithmic Learning Theory, 10, Andreas Christmann, 19

On non-parametric robust quantile regression by support vector machines

On non-parametric robust quantile regression by support vector machines Andreas Christmann joint work with: Ingo Steinwart (Los Alamos National Lab) Arnout Van Messem (Vrije Universiteit Brussel) ERCIM