Kernel machines and sparsity

Size: px

Start display at page:

Download "Kernel machines and sparsity"

Timothy Lawson
5 years ago
Views:

1 Kernel machines and sparsity 2 juillet, 2009 ENBIS 09, Saint Etienne Stéphane Canu & Alain Rakotomamonjy stephane.canu@litislab.eu

2 Roadmap 1 Introduction A typical learning problem Kernel machines: a definition 2 Tools: the functional framework In the beginning was the kernel Kernel and hypothesis set 3 Kernel machines and regularization path non sparse kernel machines regularization path piecewise linear regularization path sparse kernel machines: SVR 4 Tuning the kernel: MKL the multiple kernel problem simplemkl: the multiple kernel solution 5 Conclusion

3 Optical character recognition Example (The MNIST database) MNIST 1, data = «image-label» n = 60, 000; d = 700; classes = 10 Kernel error rate = 0.56 %, Best error rate = 0.4 %

4 Learning challenges: the size effect Number of variables Bio computing Text Object recognition Census MNIST RCV 1 Sample size Speech translation Scene analysis Geostatistic Lucy 3 key issues 1. learn any problem: 2. from data: functional universality statistical consistency 3. with large data sets: computational efficency kernel machines adress these three issues (up to a certain point regarding efficency) L. Bottou, 2006

5 Kernel machines Definition (Kernel machines) A ( ) ( n (x i, y i ) i=1,n (x) = ψ α i k(x, x i ) + i=1 α et β: parameters to be estimated. p j=1 ) β j q j (x) Exemples n A(x) = α i (x x i ) β 0 + β 1 x i=1 ( A(x) = sign α i exp x x i 2 ) b +β 0 i I( IP(y x) = 1 Z exp i I α i 1I {y=yi }(x x i + b) 2) splines SVM exponential family

6 Roadmap 1 Introduction A typical learning problem Kernel machines: a definition 2 Tools: the functional framework In the beginning was the kernel Kernel and hypothesis set 3 Kernel machines and regularization path non sparse kernel machines regularization path piecewise linear regularization path sparse kernel machines: SVR 4 Tuning the kernel: MKL the multiple kernel problem simplemkl: the multiple kernel solution 5 Conclusion

7 In the beginning was the kenrel... Definition (Kernel) a function of two variable k from X X to IR Definition (Positive kernel) A kernel k(s, t) on X is said to be positive if it is symetric: k(s, t) = k(t, s) an if for any finite positive interger n: {α i } i=1,n IR, {x i } i=1,n X, n n α i α j k(x i, x j ) 0 i=1 j=1 it is strictly positive if for α i 0 n n α i α j k(x i, x j ) > 0 i=1 j=1

8 Examples of positive kernels the linear kernel: s, t IR d, k(s, t) = s t symetric: s t = t s n n n positive: α i α j k(x i, x j ) = i=1 j=1 n α i α j x i x j i=1 j=1 ( n ) n = α i x i α j x j n 2 = α i x i i=1 j=1 i=1 the product kernel: k(s, t) = g(s)g(t) for some g : IR d IR, symetric by construction positive: n n α i α j k(x i, x j ) = n n α i α j g(x i )g(x j ) i=1 j=1 i=1 j=1 ( n = α i g(x i )) ( n n ) 2 α j g(x j ) = α i g(x i ) i=1 j=1 i=1 k is positive (its square root exists) k(s, t) = φ s, φ t J.P. Vert, 2006

9 positive definite Kernel (PDK) algebra (closure) if k 1 (s, t) and k 2 (s, t) are two positive kernels DPK are a convex cone: a 1 IR + a 1 k 1 (s, t) + k 2 (s, t) product kernel k 1 (s, t)k 2 (s, t) proofs by linearity: n n ( α i α j a1 k 1 (i, j)k 2 (i, j) ) n n n n = a 1 α i α j k 1 (i, j) + α i α j k 2 (i, j) i=1 j=1 i=1 j=1 i=1 j=1 n n ( α i α j ψ(xi )ψ(x j ) ) = ( n α i ψ(x i ) )( n α j ψ(x j ) ) by linearity: i=1 j=1 i=1 j=1 assuming ψ l s.t. k 1 (s, t) = l ψ l(s)ψ l (t) n n n n ( α i α j k 1 (x i, x j )k 2 (x i, x j ) = α i α j ψ l (x i )ψ l (x j )k 2 (x i, x j ) ) i=1 j=1 i=1 j=1 n l n ( αi ψ l (x i ) ) ( α j ψ l (x j ) ) k 2 (x i, x j ) = l i=1 j=1 N. Cristianini and J. Shawe Taylor, kernel methods for pattern analysis, 2004

10 Kernel engineering: building PDK for any polynomial with positive coef. φ from IR to IR φ ( k(s, t) ) if Ψis a function from IR d to IR d k ( Ψ(s), Ψ(t) ) if ϕ from IR d to IR +, is minimum in 0 k(s, t) = ϕ(s + t) ϕ(s t) convolution of two positive kernels is a positive kernel K 1 K 2 the Gaussian kernel is a PDK exp( s t 2 ) = exp( s 2 t 2 2s t) = exp( s 2 ) exp( t 2 ) exp(2s t) s t is a PDK and function exp as the limit of positive series expansion, so exp(2s t) is a PDK exp( s 2 ) exp( t 2 ) is a PDK as a product kernel the product of two PDK is a PDK O. Catoni, master lecture, 2005

11 some examples of PD kernels... type name ( k(s, t) radial gaussian exp r 2 b ), r = s t radial laplacian exp( r/b) radial rationnal 1 r 2 radial loc. gauss. max ( 0, 1 r 3b r 2 +b ) d exp( r 2 b ) non stat. χ 2 exp( r/b), r = k (s k t k ) 2 s k +t k projective polynomial (s t) p projective affine (s t + b) p projective cosine s ( t/ s t ) projective correlation exp s t s t b Most of the kernels depends on a quantity b called the bandwidth

12 kernels for objects and structures kernels on histograms and probability distributions ( ) k(p(x), q(x)) = k i p(x), q(x) IP(x)dx kernel on strings spectral string kernel using sub sequences similarities by alignements kernels on graphs the pseudo inverse of the (regularized) graph Laplacian k(s, t) = u φu(s)φu(t) k(s, t) = π exp(β(s, t, π)) L = D A A is the adjency matrixd the degree matrix diffusion kernels 1 Z(b) expbl subgraph kernel convolution (using random walks) and kernels on heterogeneous data (image), HMM, automata... Shawe-Taylor & Cristianini s Book, 2004 ; JP Vert, 2006

13 Roadmap 1 Introduction A typical learning problem Kernel machines: a definition 2 Tools: the functional framework In the beginning was the kernel Kernel and hypothesis set 3 Kernel machines and regularization path non sparse kernel machines regularization path piecewise linear regularization path sparse kernel machines: SVR 4 Tuning the kernel: MKL the multiple kernel problem simplemkl: the multiple kernel solution 5 Conclusion

14 From kernel to functions H 0 = { f mf < ; f j IR; t j X, f (x) = m f } j=1 f jk(x, t j ) let define the bilinear form (g(x) = m g i=1 g i k(x, s i )) : f, g H 0, f, g H0 = f j g i k(t j, s i ) m f m g j=1 i=1 Evaluation functional: x X f (x) = f (.), k(x,.) H0 from k to H with any postive kernel, a hypothesis set H = H 0 can be constructed with its metric

15 RKHS Definition (reproducing kernel Hibert space (RKHS)) a Hilbert space H embeded with the inner product.,. H is said to be with reproduicing kernel if it exists a positive kernel k such that s X, k(., s) H et f H, f (s) = f (.), k(s,.) H positive kernel RKHS any function is pointwise defined defines the inner product it defines the regularity (smoothness) of the hypothesis set

16 functional differentiation in RKHS Let J be a functional J : H IR f J(f ) examples: J 1 (f ) = f 2, J 2 (f ) = f (x), J directional derivative in direction g at point f dj(f, g) = lim ε 0 J(f + εg) J(f ) ε Gradient J (f ) J : H H f J (f ) if dj(f, g) = J (f ), g H exercice: find out J1 (f ) et J2 (f )

17 other kernels (what realy matters) finite kernels k(s, t) = ( φ 1 (s),..., φ p (s) ) ( φ1 (t),..., φ p (t) ) Mercer kernels positive on a compact set k(s, t) = p j=1 λ jφ j (s)φ j (t) positive kernels positive semi-definite conditionnaly positive (for some functions p j ) n n n {x i } i=1,n, α i, α i p j (x i ) = 0; j = 1, p, α i α j k(x i, x j ) 0 i symetric non positive k(s, t) = tanh(s t + α 0 ) i=1 j=1 non symetric non positive the key property: Jt (f ) = k(t,.) holds C. Ong et al, ICML, 2004

18 Let s summarize positive kernels RKHS = H regularity f 2 H the key property: Jt (f ) = k(t,.) holds not only for positive kernels f (x i ) exists (pointwise defined functions) universal consistency in RKHS the Gram matrix summarize the pairwise comparizons

19 Plan 1 Introduction A typical learning problem Kernel machines: a definition 2 Tools: the functional framework In the beginning was the kernel Kernel and hypothesis set 3 Kernel machines and regularization path non sparse kernel machines regularization path piecewise linear regularization path sparse kernel machines: SVR 4 Tuning the kernel: MKL the multiple kernel problem simplemkl: the multiple kernel solution 5 Conclusion

20 Interpolation splines find out f H such that f (x i ) = y i, i = 1,..., n It is an ill posed problem

21 Interpolation splines: minimum norm interpolation { min f H 1 2 f 2 H such that f (x i ) = y i, i = 1,..., n The lagrangian (α i Lagrange multipliers) L(f, α) = 1 n 2 f ( ) 2 α i f (xi ) y i i=1 optimality for f : f L(f, α) = 0 f (x) = n α i k(x i, x) dual formulation (remove f from the Lagrangian): Q(α) = 1 n n n α i α j k(x i, x j ) + α i y i solution: max 2 Q(α) α IR n i=1 j=1 i=1 Kα= y i=1

22 Interpolation splines: minimum norm interpolation { min f H 1 2 f 2 H such that f (x i ) = y i, i = 1,..., n The lagrangian (α i Lagrange multipliers) L(f, α) = 1 n 2 f ( ) 2 α i f (xi ) y i i=1 optimality for f : f L(f, α) = 0 f (x) = n α i k(x i, x) dual formulation (remove f from the Lagrangian): Q(α) = 1 n n n α i α j k(x i, x j ) + α i y i solution: max 2 Q(α) α IR n i=1 j=1 i=1 Kα= y i=1

23 Interpolation splines: minimum norm interpolation { min f H 1 2 f 2 H such that f (x i ) = y i, i = 1,..., n The lagrangian (α i Lagrange multipliers) L(f, α) = 1 n 2 f ( ) 2 α i f (xi ) y i i=1 optimality for f : f L(f, α) = 0 f (x) = n α i k(x i, x) dual formulation (remove f from the Lagrangian): Q(α) = 1 n n n α i α j k(x i, x j ) + α i y i solution: max 2 Q(α) α IR n i=1 j=1 i=1 Kα= y i=1

24 Representer theorem Theorem (Representer theorem) Let H be a RKHS with kernel k(s, t). Let l be a function from X to IR (loss function) and Φ a non decreasing function from IR to IR. If there exists a function f minimizing: f = argmin f H n i=1 l ( y i, f (x i ) ) + Φ ( f 2 ) H then there exists a vector α IR n such that: f (x) = n i=1 α i k(x, x i ) it can be generalized to the semi parametric case: + m j=1 β jφ j (x)

25 Smoothing splines introducing the error (the slack) ξ = f (x i ) y i n 1 min (S) f H 2 f 2 H + 1 2λ ξi 2 i=1 such that f (x i ) = y i + ξ i, i = 1, n three equivalents definitions min f H such that (S 1 n ( ) 2 λ ) min f (xi ) y i + f H 2 2 f 2 H i=1 1 2 f 2 H n ( ) 2 min f (xi ) y n i ( ) 2 f (xi ) y i C f H i=1 such that f 2 i=1 H C using the representer theorem(s 1 ) min α IR n 2 Kα y 2 + λ 2 α Kα (K + λi )α = y 1 using min α IR n 2 Kα y 2 + λ 2 α α α = (K K + λi ) 1 K y

26 From 0 to interpollation problem: min f H 1 2 n i=1 ( f (xi ) y i ) 2 + λ 2 f 2 H solution: α(λ) = (K + λi ) 1 y λ = 0 : α(0) = K 1 y: interpollation λ = : α( ) = 0: Regularization path S: set of solutions as a function of λ S = { α(λ) λ [0, [ } also called the solution path

27 1D Ridge Regression in the costs domain the Loss term L as a function of the penalty term P n (x i α y i ) 2 + λ α 2 min α IR i=1 L(α) = n (x i α y i ) 2 i=1 P(α) = α 2 { L(P) = ap ± b P + c a, b and c IR Figure: regularization path as a function of the criteria L and P.

28 How to tune the regularization parameter λ? brute force for each λ 1 < λ 2 <... < λ k <... < λ K compute α k = (K + λ k I ) 1 y, k = 1, K warm start α k = Φ(α k 1 ) warm start + prediction step O ( Kn 3) (using l conjugate gradient iterations) O ( Kln 2) α (p) k = α k 1 + ρ α(l(α k 1 ) + λ k P(α k 1 )) (prediction) α k = Φ(α (p) k ) (correction step using CG) O ( Kl n 2) use only the prediction step! α k = α k 1 + λ k Ψ(α k 1 ) (prediction step) to do so the regularization path has to be piecewise linear O ( Kn 2)

29 How to tune the regularization parameter λ? brute force for each λ 1 < λ 2 <... < λ k <... < λ K compute α k = (K + λ k I ) 1 y, k = 1, K warm start α k = Φ(α k 1 ) warm start + prediction step O ( Kn 3) (using l conjugate gradient iterations) O ( Kln 2) α (p) k = α k 1 + ρ α(l(α k 1 ) + λ k P(α k 1 )) (prediction) α k = Φ(α (p) k ) (correction step using CG) O ( Kl n 2) use only the prediction step! α k = α k 1 + λ k Ψ(α k 1 ) (prediction step) to do so the regularization path has to be piecewise linear O ( Kn 2)

30 How to tune the regularization parameter λ? brute force for each λ 1 < λ 2 <... < λ k <... < λ K compute α k = (K + λ k I ) 1 y, k = 1, K warm start α k = Φ(α k 1 ) warm start + prediction step O ( Kn 3) (using l conjugate gradient iterations) O ( Kln 2) α (p) k = α k 1 + ρ α(l(α k 1 ) + λ k P(α k 1 )) (prediction) α k = Φ(α (p) k ) (correction step using CG) O ( Kl n 2) use only the prediction step! α k = α k 1 + λ k Ψ(α k 1 ) (prediction step) to do so the regularization path has to be piecewise linear O ( Kn 2)

31 How to tune the regularization parameter λ? brute force for each λ 1 < λ 2 <... < λ k <... < λ K compute α k = (K + λ k I ) 1 y, k = 1, K warm start α k = Φ(α k 1 ) warm start + prediction step O ( Kn 3) (using l conjugate gradient iterations) O ( Kln 2) α (p) k = α k 1 + ρ α(l(α k 1 ) + λ k P(α k 1 )) (prediction) α k = Φ(α (p) k ) (correction step using CG) O ( Kl n 2) use only the prediction step! α k = α k 1 + λ k Ψ(α k 1 ) (prediction step) to do so the regularization path has to be piecewise linear O ( Kn 2)

32 How to choose L and P to get linear reg. path? Solution path is linear min α IR d L(α) + λp(α) one cost is piecewise quadratic and the other one piecewise linear α(λ + ε) α(λ) 1. Piecewise linearity: lim = constant ε 0 ε 2. optimality 3. use Taylor expantion convex case [Rosset & Zhu, 07] L(α(λ)) + λ P(α(λ)) = 0 L(α(λ + ε)) + (λ + ε) P(α(λ + ε)) = 0 α(λ + ε) α(λ) lim = [ 2 L(α(λ)) + λ 2 P(α(λ)) ] 1 P(α(λ)) ε 0 ε 2 L(α(λ)) = constant and 2 P(α(λ)) = 0

standart formulation portfolio optimization (Markovitz, { 1952) return vs. risk min 1 α 2 α Qα with e α = C efficiency frontier: piecewise linear (Critical path Algo.

33 standart formulation portfolio optimization (Markovitz, { 1952) return vs. risk min 1 α 2 α Qα with e α = C efficiency frontier: piecewise linear (Critical path Algo.) Sensitivity analysis: standart formulation (Heller, 1954) { min 1 α 2 α Qα + (c + λ c) α with Aα = b + µ b Parametric programming (see T. Gal s book 1968) in the general case of PQP: the reg. path is piecewise linear PLP and multi parametric programming

34 Piecewise linear regularization path algorithms L P regression classification clustering L 2 L 1 Lasso/LARS L1 L2 SVM L1 PCA/SVD L 1 L 2 SVR SVM OC SVM L 1 L 1 L1 LAD L1 SVM Danzig Selector Table: example of piecewise linear regularization path algorithms. d P : L p = β j p L : L p : f (x) y p hinge (yf (x) 1) p + j=1 ε-insensitive { 0 if f (x) y < ε f (x) y ε else Huber s loss: { f (x) y 2 if f (x) y < t 2t f (x) y t 2 else

35 the world is changing The Gaussian Hare and the Laplacian Tortoise Computability of L 1 vs. L 2 Regression Estimators. Portnoy & Koenker 1997

36 The consequence of having useful reg. path min L(α) + λp(α) {α(λ) λ [0, ]} α IR d efficient computing piecewise linearity α NEW = α OLD + (λ NEW λ OLD )u = piecewise linearity L 1 criteria = either L or P is L 1 = sparsity: a lot of α j = 0 sparsity and active constraints why does L 1 provide sparsity?

37 Definition: strong homogeneity set (variables) I 0 = { j {1,..., d} αj = 0 } Theorem Regular if L(α) + λp(α) differentiable and if I 0 (y) ε > 0, y B(y, ε) such that I 0 (y ) I 0 (y) Singular if L(α) + λp(α) NONdifferentiable and if I 0 (y) ε > 0, y B(y, ε) then I 0 (y ) = I 0 (y) singular criteria = sparsity Nikolova, 2000 L 1 criteria are singular in 0 singurality provides sparsity

38 L1 Splines: introducing sparsity The L1 error (S 1 ) min f H 1 2 f 2 H + 1 λ n ξ i i=1 such that f (x i ) = y i + ξ i, i = 1, n representer theorem: The dual: { (D 1 ) f (x) = n (α + α )k(x, x i ) i=1 1 min α 2 (α + α ) K(α + α ) + (α + + α ) y + α such that 0 α + i 1 λ, 0 α i 1 λ, i = 1, n Typical parametric quadratic program (pqp) but α i 0

39 K-Lasso (Kernel Basis pursuit) The Kernel Lasso (S 1 ) { 1 min α IR n 2 Kα y 2 + λ n α i i=1 Typical parametric quadratic program (pqp) with α i = 0 Piecewise linear regularization path The dual: (D 1 ) { min α such that 1 2 Kα 2 K (Kα y) t The K-Danzig selector can be treated the same way require to compute K K - no more function f!

40 Support vector regression (SVR) Lasso s dual adaptation: { min 1 α 2 Kα 2 s. t. K (Kα y) t { min f H 1 2 f 2 H s. t. f (x i ) y i t, i = 1, n The support vector regression introduce slack variables (SVR) { 1 min f H 2 f 2 H + C ξ i such that f (x i ) y i t + ξ i 0 ξ i i = 1, n a typical multi parametric quadratic program (mpqp) piecewise linear regularization path α(c, t) = α(c 0, t 0 ) + ( 1 C 1 C 0 )u + 1 C 0 (t t 0 )v 2d Pareto s front (the tube width and the regularity)

41 y y Introduction Tools: the functional framework Kernel machines and regularization path Tuning the kernel: MKL Conclusion Support vector regression illustration 1 Support Vector Machine Regression 1.5 Support Vector Machine Regression x x C large C small there exists other formulations such as LP SVR...

42 Large scale pqp Not yet adapted reweighed LS interior points projected gradient Adapted homotopy (regularization path) and other pqp active set decomposition coordinate wise (Gauss Seidel) other: cutting plane, proximal...

43 checker board 2 classes 500 examples separable

44 a separable case n = 5000 data points n = 500 data points

45 empirical complexity Results for C = 1 Left : γ = 1 Right γ = 0.3 Results for C = 1000 Left : γ = 1 Right γ = 0.3 Results for C = Left : γ = 1 Right γ = Training time results in cpu seconds (log scale) Training size (log) Training size (log) Training size (log) Training size (log) Training size (log) Training size (log) Number of Support Vectors (log scale) Training size (log) Training size (log) Training size (log) Training size (log) Training size (log) Training size (log) Error rate (%) CVM LibSVM SimpleSVM (over 2000 unseen points) Training size (log) Training size (log) Training size (log) Training size (log) Training size (log) Training size (log) G. Loosli et al JMLR, 2007

46 Plan 1 Introduction A typical learning problem Kernel machines: a definition 2 Tools: the functional framework In the beginning was the kernel Kernel and hypothesis set 3 Kernel machines and regularization path non sparse kernel machines regularization path piecewise linear regularization path sparse kernel machines: SVR 4 Tuning the kernel: MKL the multiple kernel problem simplemkl: the multiple kernel solution 5 Conclusion

47 Multiple Kernel The model l f (x) = α i K(x, x i ) + b, i=1 Given M kernel functions K 1,..., K M that are potentially well suited for a given problem, find a positive linear combination of these kernels such that the resutling kernel K is optimal K(x, x ) = M m=1 d m K m (x, x ), with d m 0, m d m = 1 Need to learn together the kernel coefficients d m and the SVR parameters α i, b.

48 Multiple Kernel functional Learning min {f m},b,ξ,d The problem (for given C and t) 1 1 min f m 2 H {f m},b,ξ,d 2 d m + C ξ i m m i s.t. f m (x i ) + b y i t + ξi iξ i 0 i m d m = 1, d m 0 m, regularization formulation min {f m},b,d m 1 1 f m 2 H 2 d m + C max( f m (x i ) + b y i t, 0) m m i m d m = 1, d m 0 m, m Equivalently ( max i m ) f m (x i ) + b y i t, C m 1 d m f m 2 H m + µ m d m

49 Multiple Kernel functional Learning The problem (for given C and t) 1 1 min f m 2 H {f m},b,ξ,d 2 d m + C ξ i m m i s.t. f m (x i ) + b y i t + ξi iξ i 0 i m d m = 1, d m 0 m, m Treated as a bi-level optimization task min d IR M s.t. 1 1 min f m 2 H {f m},b,ξ 2 d m + C ξ i m m i s.t. f m (x i ) + b y i t + ξi m ξ i 0 i d m = 1, d m 0 m, m i

50 Multiple Kernel Algorithm Use a Reduced Gradient Algorithm 2 min d IR M s.t. J(d) d m = 1, d m 0 m, m SimpleMKL algorithm set d m = 1 M for m = 1,..., M while stopping criterion not met do compute J(d) using an QP solver with K = m d mk m compute J d m, Hessian and descent direction D γ compute optimal stepsize d d + γd end while Recent improvement reported using the Hessian 2 Rakotomamonjy et al. JMLR 08

51 Complexity For each iteration: SVM training: O(nn sv + n 3 sv). Inverting K sv,sv is O(n 3 sv), but might already be available as a by-product of the SVM training. Computing H: O(Mn 2 sv) Finding d: O(M 3 ). The number of iterations is usually less than 10. When M < n sv, computing d is not more expensive than QP.

52 Multiple Kernel experiments 1 LinChirp 1 Wave 1 Blocks 1 Spikes x x x x Single Kernel Kernel Dil Kernel Dil-Trans Data Set Norm. MSE (%) #Kernel Norm. MSE #Kernel Norm. MSE LinChirp 1.46 ± ± ± 0.20 Wave 0.98 ± ± ± 0.07 Blocks 1.96 ± ± ± 0.13 Spike 6.85 ± ± ± 0.84 Table: Normalized Mean Square error averaged over 20 runs.

53 Conclusion Kernels sparsity L 1 efficent algorithm some limits instability large data sets when to stop non convexity Perspectives more algorithms, more criteria, more applications

54 Conclusion Kernels sparsity L 1 efficent algorithm some limits instability coupling with active set large data sets randomize when to stop derive relevant bounds non convexity iterative L1 Perspectives more algorithms, more criteria, more applications

55 Questions? questions? LAR(s) (and svmpath) in R Matlab www-stat.stanford.edu/~hastie asi.insa-rouen.fr/~vguigue/lars.html kernlab SVR Matlab cran.r-project.org/src/contrib/descriptions/kernlab.html asi.insa-rouen.fr/~arakotom Danzig selector

56 biblio Least Angle Regression. Least angle regression, Bradley Efron, ; Annals of statistics, vol. 32 (2), pp , 2004 Piecewise Linear Regularized Solution Paths, Saharon Rosset, Ji Zhu. Annals of Statistics, to appear, 2007 Local strong homogeneity of a regularized estimator, Mila Nikolova, SIAM Journal on Applied Mathematics, vol. 61, no. 2, pp , Two-Dimensional Solution Path for Support Vector Regression, Gang Wang, Dit-Yan Yeung, Frederick H. Lochovsky, ICML, 2006 Algorithmic linear dimension reduction in the l 1 norm for sparse vectors, A. C. Gilbert, M. J. Strauss, J. A. Tropp, and R. Vershynin, 2006 (submitted). A tutorial on support vector regression, A. Smola and B. Scholkopf, Statistics and Computing 14(3), pp , 2004 The Dantzig selector: Statistical estimation when p is much smaller than n, Emmanuel Candes and Terence Tao Submitted to IEEE Transactions on Information Theory, June An iterative thresholding algorithm for linear inverse problems with a sparsity constraint, I. Daubechies, M. Defrise and C. De Mol, Comm. Pure Appl. Math, 57, pp , Pathwise coordinate optimization, Jerome Friedman, Trevor Hastie and Robert Tibshirani, technical report, May 2007.

Lecture 9: Multi Kernel SVM

Lecture 9: Multi Kernel SVM Stéphane Canu stephane.canu@litislab.eu Sao Paulo 204 April 6, 204 Roadap Tuning the kernel: MKL The ultiple kernel proble Sparse kernel achines for regression: SVR SipleMKL: