RegML 2018 Class 2 Tikhonov regularization and kernels

Size: px

Start display at page:

Download "RegML 2018 Class 2 Tikhonov regularization and kernels"

Emery Rice
5 years ago
Views:

1 RegML 2018 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT June 17, 2018

2 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n = (x i, y i ) n (ρ, fixed, unknown).

3 Empirical Risk Minimization (ERM) min f H E(f) min Ê(f) f H proxy to E Ê(f) = 1 n L(f(x i ), y i )

4 From ERM to regularization ERM can be a bad idea if n is small and H is big Regularization min Ê(f) min f H f H Ê(f) + λ R(f) }{{} regularization λ regularization parameter

5 Examples of regularizers Let p f(x) = φ j (x)w j j=1 l 2 p R(f) = w 2 = w j 2, j=1 l 1 p R(f) = w 1 = w j, j=1 Differential operators R(f) = f(x) 2 dρ(x),... X

6 From statistics to optimization Problem Solve min Ê(w) + λ w 2 w Rp with Ê(w) = 1 n L(w x i, y i ).

7 Minimization min Ê(w) + λ w 2 w Strongly convex functional Computations depends on the considered function

8 Logistic regression Ê λ (w) = 1 log(1 + e yiw x i ) + λ w 2. n }{{} smooth and strongly convex Êλ(w) = 1 n x i y i 1 + e yiw x i + 2λw

9 Gradient descent Let F : R d R differentiable, (strictly) convex and such that F (w) F (w ) L w w (e.g. sup w H(w) L) }{{} hessian Then w 0 = 0, w t+1 = w t 1 L F (w t), converges to the minimizer of F.

10 Gradient descent for LR 1 min w R p n log(1 + e yiw x i ) + λ w 2 Consider w t+1 = w t 1 L [ 1 n x i y i 1 + e yiw t xi + 2λw t ]

11 Complexity Logistic: O(ndT ) n number of examples, d dimensionality, T number of steps What if n d? Can we get better complexities?

12 Representer theorems Idea Show that f(x) = w x = x i xc i, c i R. i.e. w = n x ic i. Compute c = (c 1,..., c n ) R n rather than w R d.

13 Representer theorem for GD & LR By induction c t+1 = c t 1 L [ 1 n e i y i 1 + e yift(xi) + 2λc t ] with e i the i-th element of the canonical basis and f t (x) = x x i (c t ) i

14 Non-linear features d p f(x) = w i x i f(x) = w i φ i (x i ). Model Φ(x) = (φ 1 (x),..., φ p (x)),

15 Non-linear features d p f(x) = w i x i f(x) = w i φ i (x i ). Computations Same up-to the change Φ(x) = (φ 1 (x),..., φ p (x)), x Φ(x)

16 Representer theorem with non-linear features f(x) = x i xc i f(x) = Φ(x i ) Φ(x)c i

17 Rewriting logistic regression and gradient descent By induction c t+1 = c t 1 L [ 1 n e i y i 1 + e yift(xi) + 2λc t ] with e i the i-th element of the canonical basis and f t (x) = Φ(x) Φ(x) i (c t ) i

18 Hinge loss and support vector machines Ê λ (w) = 1 1 y i w x i + + λ w 2 n }{{} non-smooth & strongly-convex Consider left derivative S i (w) = w t+1 = w t 1 L t ( 1 n ) S i (w t ) + 2λw t { y i x i if y i w x i 1, L = sup x + 2λ. 0 otherwise x X

19 Subgradient Let F : R p R convex, Subgradient F (w 0 ) set of vectors v R p such that, for every w R p F (w) F (w 0 ) (w w 0 ) v In one dimension F (w 0 ) = [F (w 0 ), F +(w 0 )].

20 Subgradient method Let F : R p R strictly convex, with bounded subdifferential, and γ t = 1/t then, w t+1 = w t γ t v t with v t F (w t ) converges to the minimizer of F.

21 Subgradient method for SVM 1 min w R p n 1 y i w x i + + λ w 2 Consider w t+1 = w t 1 t ( 1 n ) S i (w t ) + 2λw t S i (w t ) = { y i x i if y i w x i 1 0 otherwise

22 Representer theorem of SVM By induction c t+1 = c t 1 t ( 1 n ) S i (c t ) + 2λc t with e i the i-th element of the canonical basis, f t (x) = x x i (c t ) i and S i (c t ) = { y i e i if y i f t (x i ) < 1. 0 otherwise

23 Rewriting SVM by subgradient By induction c t+1 = c t 1 t ( 1 n ) S i (c t ) + 2λc t with e i the i-th element of the canonical basis, f t (x) = Φ(x) Φ(x) i (c t ) i and S i (c t ) = { y i e i if y i f t (x i ) < 1. 0 otherwise

24 Optimality condition for SVM Smooth Convex F (w ) = 0 Non-smooth Convex 0 F (w) 0 F (w ) 0 1 y i w x i + + λ2w w 1 2λ 1 y iw x i +.

25 Optimality condition for SVM (cont.) The optimality condition can be rewritten as 0 = 1 n ( y i x i c i ) + 2λw w = where c i = c i (w) [V ( y i w x i ), V + ( y i w x i )]. x i ( y ic i 2λn ). A direct computation gives c i = 1 if yf(x i ) < 1 0 c i 1 if yf(x i ) = 1 c i = 0 if yf(x i ) > 1

26 Support vectors c i = 1 if yf(x i ) < 1 0 c i 1 if yf(x i ) = 1 c i = 0 if yf(x i ) > 1

27 Complexity Without representer Logistic: O(ndT ) SVM: O(ndT ) With representer Logistic: O(n 2 (d + T )) SVM: O(n 2 (d + T )) n number of example, d dimensionality, T number of steps

28 Are loss functions all the same? min Ê(w) + λ w 2 w each loss has a different target function and different computations The choice of the loss is problem dependent

29 So far regularization by penalization iterative optimization linear/non-linear parametric models What about nonparametric models?

30 From features to kernels f(x) = x i xc i f(x) = Φ(x i ) Φ(x)c i Kernels Φ(x) Φ(x ) K(x, x ) f(x) = K(x i, x)c i

31 LR and SVM with kernels As before: LR SVM But now c t+1 = c t 1 L [ c t+1 = c t 1 t f t (x) = 1 n ( 1 n e i y i 1 + e yift(xi) + 2λc t ) S i (c t ) + 2λc t K(x, x i )(c t ) i ]

32 Examples of kernels Linear K(x, x ) = x x Polynomial K(x, x ) = (1 + x x) p, with p N Gaussian K(x, x ) = e γ x x 2, with γ > 0 f(x) = c i K(x i, x)

33 Kernel engineering Kernels for Strings, Graphs, Histograms, Sets,...

34 What is a kernel? K(x, x ) Similarity measure Inner product Positive definite function

35 Positive definite function K : X X R is positive definite, when for any n N, x 1,..., x n X, let K n such that K n R n n, (K n ) ij = K(x i, x j ), then K n is positive semidefinite, (eigenvalues 0)

36 PD functions and RKHS Each PD Kernel defines a function space called Reproducing kernel Hilbert space (RKHS) H = span {K(, x) x X}.

37 Nonparametrics and kernels Number of parameters automatically determined by number of points f(x) = K(x i, x)c i Compare to p f(x) = φ j (x)w j j=1

38 This class Learning and Regularization: logistic regression and SVM Optimization with first order methods Linear and Non-linear parametric models Non-parametric models and kernels

39 Next class Beyond penalization Regularization by subsampling stochastic projection

Oslo Class 2 Tikhonov regularization and kernels

RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n