On non-parametric robust quantile regression by support vector machines Andreas Christmann joint work with: Ingo Steinwart (Los Alamos National Lab) Arnout Van Messem (Vrije Universiteit Brussel) ERCIM 2008, Neuchâtel, Switzerland, June 19-21, 2008 On non-parametric robust quantile regression by support vector machines 1
Example linear QR logratio 1.2 0.8 0.4 0.0 400 450 500 550 600 650 700 range LIDAR data set. Quantiles α = 0.95, 0.50, 0.05 On non-parametric robust quantile regression by support vector machines 2
Example SVM for quantile regression logratio 1.2 0.8 0.4 0.0 400 450 500 550 600 650 700 range LIDAR data set. Quantiles α = 0.95, 0.50, 0.05 On non-parametric robust quantile regression by support vector machines 3
Nonparametric Quantile Regression Assumptions: X = complete measurable space, e.g. X = R d Y R closed, Y = D = D n = (z 1,..., z n ), z i := (x i, y i ) Z := X Y, D := 1 n n i=1 δ (x i,y i ) (X i, Y i ) i.i.d. P M 1, P (totally) unknown Goal: estimate quantile function f α,p(x) = inf {q Y; P(Y q X = x) α}, x X Assumption: f α,p unique If f α,p is linear: Koenker & Bassett 78, Koenker 05 On non-parametric robust quantile regression by support vector machines 4
Loss Function and Risk Pinball loss function: { (α 1)(y t), (y t) < 0 L α (y, t) := α(y t), (y t) 0 alpha=0.1 alpha=0.75 Loss of y t 0 2 4 6 8 Loss of y t 0 2 4 6 10 5 0 5 10 10 5 0 5 10 Risk: R Lα,P(f) := E P L α ( Y, f(x) ), y t y t P M1 On non-parametric robust quantile regression by support vector machines 5
Support Vector Machine approach Schölkopf et al 00 and Takeuchi et al. 06 proposed: f D,λ = arg min f H 1 n n ( L α Yi, f(x i ) ) + λ f 2 H, i=1 where H is a reproducing kernel Hilbert space (RKHS) and λ > 0. S(P) = f P,λ = arg min f H E PL α ( Y, f(x) ) + λ f 2 H, P M 1, On non-parametric robust quantile regression by support vector machines 6
Kernel k is kernel: k : X X K, if K-Hilbert space H and Φ : X H such that k(x, x ) = Φ(x ), Φ(x) H, x, x X H is reproducing kernel Hilbert space (RKHS): x X : δ x (f) := f(x), f H, is continuous reproducing kernel: k(, x) H, x X, and f(x) = f, k(, x) H, f H, x X Φ is canonical feature map: Φ(x) = k(, x), x X bounded: k := sup x X k(x, x) < GRBF: k(x, x ) = e γ x x 2 2, γ > 0 On non-parametric robust quantile regression by support vector machines 7
Consistency f D,λn is risk consistent, if R Lα,P(f D,λn ) P R L α,p := inf f:x R measurable R L α,p(f) (2.1) R L α,p,h := inf f H R L α,p(f) = R L α,p (2.2) Large RKHS (CHR & Steinwart 08) Let H be the RKHS of a bounded kernel k : X X R and µ be a distribution on X. Then the following statements are equivalent: 1 H is dense in L 1 (µ). 2 (2.2) holds for all P M 1 with P X = µ and E P Y <. On non-parametric robust quantile regression by support vector machines 8
Bounded L-risk (CHR & Steinwart 08) Assume: P M 1 with E P Y <, f : X R with f L 1 (P). Then: R Lα,P(f) <. On non-parametric robust quantile regression by support vector machines 9
Existence and uniqueness (CHR & Steinwart 08) Assume: P M 1 with E P Y <, H RKHS of bounded kernel k, λ > 0. Then: 1 exists unique minimizer S(P) = f P,λ H 2 f P,λ H R Lα,P(0)/λ. On non-parametric robust quantile regression by support vector machines 10
Consistency (Steinwart & CHR 08a) Assume: H separable RKHS of a bounded measurable kernel k such that H is dense in L 1 (µ) for all distributions µ on X (λ n ) n N with λ n 0. If λ 2 nn, then for all P M 1, E P Y < : 1 R Lα,P(f D,λn ) P R L α,p 2 f D,λn f α,p L 0 (P X ) P 0 If δ > 0 and λ 2+δ n n, then for all P M 1, E P Y < : 3 R Lα,P(f D,λn ) a.s. R L α,p a.s. 0 4 f D,λn f α,p L 0 (P X ) On non-parametric robust quantile regression by support vector machines 11
Rate of convergence 1 No-free-lunch theorem: no uniform rate of convergence! [Devroye 82] 2 Under many assumptions we have [Steinwart & CHR 08b] f D,λn f α,p L (P X ) c n 1/3 On non-parametric robust quantile regression by support vector machines 12
Robustness What is the impact on S(P) or S(P n ) due to violations from (X i, Y i ) i.i.d. P, P M 1 unknown? On non-parametric robust quantile regression by support vector machines 13
Bias, maxbias, sensitivity curve (CHR & Steinwart 07) Assume: E P Y < and E P Y <, (can be weakend) H RKHS of continuous and bounded kernel k, λ > 0, ε > 0. Then we have with c := λ 1 k max{α, 1 α} 1 Bias: f (1 ε)p+ε P,λ f P,λ H c P P tv ε 2 Maxbias: sup Q Nε(P) f Q,λ f P,λ H 2 c ε 3 Sensitivity curve: SC n (z; S n ) H 2c, z X Y On non-parametric robust quantile regression by support vector machines 14
On non-parametric Goal: Bounded robust quantile regression BIF by support vector machines 15 Bouligand Influence Function g : X Z is Bouligand differentiable at x 0 X, if a positive homogeneous function B g(x 0 ) : X Z exists with g(x0 + h) g(x 0 ) B g(x 0 )(h) Z lim = 0. h 0 h X Def. (CHR & Van Messem 08) The Bouligand influence function (BIF) of a function S : M 1 H for a distribution P in the direction of a distribution Q P is the special B-derivative (if it exists) lim ε(q P) 0 S ( P + ε(q P) ) S(P) BIF(Q; S, P) H ε Q P = 0.
On non-parametric Goal: Bounded robust quantile regression BIF by support vector machines 16 Bouligand Influence Function g : X Z is Bouligand differentiable at x 0 X, if a positive homogeneous function B g(x 0 ) : X Z exists with g(x0 + h) g(x 0 ) B g(x 0 )(h) Z lim = 0. h 0 h X Def. (CHR & Van Messem 08) The Bouligand influence function (BIF) of a function S : M 1 H for a distribution P in the direction of a distribution Q P is the special B-derivative (if it exists) lim ε 0 S ( (1 ε)p + εq ) S(P) BIF(Q; S, P) H ε = 0.
Bounded BIF (CHR & Van Messem 08) Assume: k bounded and measurable E P Y <, E Q Y < (can be weakend) δ > 0 positive constants ξ P, ξ Q, c P, and c Q such that t R with t f P,λ (x) δ k the following inequalities hold a [0, 2δ k ] and x X : Then: P ( Y [t, t+a] x ) c P a 1+ξ P, Q ( Y [t, t+a] x ) c Q a 1+ξ Q. BIF(Q; S, P) with S(P) := f P,λ 1 exists ( ( 2 1 equals 2λ X P Y fp,λ (x) ) ) x α Φ(x) dpx (x) ( ( 1 2λ X Q Y fp,λ (x) ) ) x α Φ(x) dqx (x) 3 bounded. On non-parametric robust quantile regression by support vector machines 17
Conclusions Non-parametric quantile regression by SVMs 1 exists a unique solution 2 consistent L α -risk of f D,λn converges to Bayes risk (in P) f D,λn converges to true quantile function (in P or a.s.) rate of convergence 3 robust if kernel bounded Bouligand-IF, sensitivity curve, maxbias are bounded 4 computable for large high-dimensional data sets On non-parametric robust quantile regression by support vector machines 18
References CHR & Steinwart (2007). Bernoulli. CHR & Steinwart (2008). Appl. Stochastic Models Bus. Ind. CHR & Van Messem (2008). J. Mach. Learn. Res. Koenker (2005). Quantile Regression. Cambridge U.P. Koenker & Bassett (1978). Econometrica Schölkopf & Smola (2002). Learning with Kernels. MIT Press. Steinwart & CHR (2008). Support Vector Machines. Springer. Steinwart & Chr (2008b). Advances in Neural Information Processing Systems, 20, 305-312. Takeuchi, Le, Sears, Smola (2006). J. Mach. Learn. Res. On non-parametric robust quantile regression by support vector machines 19
Appendix: Example nonparametric QRSS logratio 1.2 0.8 0.4 0.0 400 450 500 550 600 650 700 range LIDAR data set. Quantiles α = 0.95, 0.50, 0.05 [R-package quantreg, rqss(logratio qss(range,constraint= N,lambda=25), tau=0.5)] On non-parametric robust quantile regression by support vector machines 20