Oslo Class 2 Tikhonov regularization and kernels

RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017

Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n = (x i, y i ) n (ρ, fixed, unknown). How can we design reliable algorithms?

This class Regularization by penalization Logistic regression Representer theorems and Kernels Support vector machines

Empirical Risk Minimization (ERM) min f H E(f) min Ê(f) f H proxy to E Ê(f) = 1 n L(f(x i ), y i )

From ERM to regularization ERM can be a bad idea if n is small and H is big Penalization min Ê(f) min f H f H Ê(f) + λ R(f) }{{} regularization λ regularization parameter

Examples of regularizers Let p f(x) = φ j (x)w j j=1 l 2 p R(f) = w 2 = w j 2, j=1 l 1 p R(f) = w 1 = w j, j=1 Differential operators R(f) = f(x) 2 dρ(x),... X

From statistics to optimization Problem Solve min Ê(w) + λ w 2 w Rp with Ê(w) = 1 n L(w x i, y i ).

Minimization min Ê(w) + λ w 2 w Strongly convex continuous coercive functional Computations depends on the considered loss function

Hinge loss & SVM 2.0 1.5 1.0 0.5 0 1 loss square loss Hinge loss Logistic loss 0.5 1 2 log (1 + e yf(x) ) Linear Non linear Solution by gradient descent

Logistic regression Ê λ (w) = 1 log(1 + e yiw x i ) + λ w 2. n }{{} smooth and strongly convex Êλ(w) = 1 n x i y i 1 + e yiw x i + 2λw

Gradient descent Let F : R d R differentiable, (strictly) convex and such that F (w) F (w ) L w w (e.g. sup w H(w) L) }{{} hessian Then w 0 = 0, w t+1 = w t 1 L F (w t), converges to the minimizer of F.

Gradient descent for LR 1 min w R p n log(1 + e yiw x i ) + λ w 2 Consider w t+1 = w t 1 L [ 1 n x i y i 1 + e yiw t xi + 2λw t ] Excercise: compute a good step-size!

Remarks Complexity: O(ndT ) n number of examples, d dimensionality, T number of steps What about non-linear functions?

Non-linear features d p f(x) = w i x i f(x) = w i φ i (x i ).

Non-linear features d p f(x) = w i x i f(x) = w i φ i (x i ). Model Φ(x) = (φ 1 (x),..., φ p (x)),

Gradient descent for LR Same up-to the change x Φ(x)... 1 min w R p n log(1 + e yiw Φ(x i) ) + λ w 2 Consider w t+1 = w t 1 L [ 1 n x i y i 1 + e yiw t Φ(xi) + 2λw t ]

Gradient descent for LR Same up-to the change x Φ(x)... 1 min w R p n log(1 + e yiw Φ(x i) ) + λ w 2 Consider w t+1 = w t 1 L [ 1 n x i y i 1 + e yiw t Φ(xi) + 2λw t ] Complexity is O(npT )

Representer theorem and Kernels Nonparametrics? d f(x) = w i x i f(x) = w i φ i (x i ).

Representer theorem and Kernels Nonparametrics? d f(x) = w i x i f(x) = w i φ i (x i ). An equivalent formulation for gradient descent provides a way in...

Representer theorem for GD & LR By induction c t+1 = c t 1 L [ 1 n e i y i 1 + e yift(xi) + 2λc t ] with e i the i-th element of the canonical basis and f t (x) = x x i (c t ) i Show as an excercise!

Representer theorems Idea Show that f(x) = w x = x i xc i, c i R. i.e. w = n x ic i.

Representer theorems Idea Show that i.e. w = n x ic i. f(x) = w x = x i xc i, c i R. Then: Replace inner product x x K(x, x ) = Φ(x) T Φ(x) Compute c = (c 1,..., c n ) R n rather than w R d.

From features to kernels f(x) = x i xc i f(x) = Φ(x i ) Φ(x)c i Kernels Φ(x) Φ(x ) K(x, x )

From features to kernels f(x) = x i xc i f(x) = Φ(x i ) Φ(x)c i Kernels Φ(x) Φ(x ) K(x, x ) f(x) = K(x i, x)c i

LR with kernels GD-LR becomes c t+1 = c t 1 L [ 1 n e i y i 1 + e yift(xi) + 2λc t ] where f t (x) = K(x, x i )(c t ) i What are (good) kernels?

Examples of kernels Linear K(x, x ) = x x Polynomial K(x, x ) = (1 + x x) p, with p N Gaussian K(x, x ) = e γ x x 2, with γ > 0

Examples of kernels Linear K(x, x ) = x x Polynomial K(x, x ) = (1 + x x) p, with p N Gaussian K(x, x ) = e γ x x 2, with γ > 0 f(x) = c i K(x i, x)

Kernel engineering Kernels for Strings, Graphs, Histograms, Sets,...

What is a kernel? K(x, x ) A similarity measure An inner product A positive definite function

Positive definite function K : X X R is positive definite, when for any n N, x 1,..., x n X, let K n such that K n R n n, (K n ) ij = K(x i, x j ), then K n is positive semidefinite, (eigenvalues 0)

PD functions, RKHS and Gaussian processes Each PD Kernel defines: a function space called Reproducing kernel Hilbert space (RKHS)... H = span {K(, x) x X}. a Gaussian process with covariance function K.

Nonparametrics and kernels Number of parameters automatically determined by number of points f(x) = K(x i, x)c i Compare to p f(x) = φ j (x)w j j=1

Hinge loss 2.0 1.5 1.0 0.5 0 1 loss square loss Hinge loss Logistic loss 0.5 1 2 1 yf(x) + Linear Non linear Solution by sub-gradient method

Support vector machine (SVM) Ê λ (w) = 1 1 y i w x i + + λ w 2 n }{{} strongly-convex but non-smooth!

Subgradient

Subgradient Let F : R p R convex. Subgradient F (w 0 ) set of vectors v R p such that, for every w R p F (w) F (w 0 ) (w w 0 ) v In one dimension F (w 0 ) = [F (w 0 ), F +(w 0 )].

Subgradient method Let F : R p R strictly convex, with bounded subdifferential, and γ t = 1/ t then, w t+1 = w t γ t v t with v t F (w t ) converges to the minimizer of F.

Subgradient method for SVM 1 min w R p n 1 y i w x i + + λ w 2 Consider left derivative w t+1 = w t 1 t ( 1 n ) S i (w t ) + 2λw t S i (w) = { y i x i if y i w x i 1, 0 otherwise

Remarks Traditional solution is solving a quadratic program (QP)...

Remarks Traditional solution is solving a quadratic program (QP)...... but it doesn t scale (and is more complicated!).

Remarks Traditional solution is solving a quadratic program (QP)...... but it doesn t scale (and is more complicated!). Easy extension to stochastic gradient.

Remarks Traditional solution is solving a quadratic program (QP)...... but it doesn t scale (and is more complicated!). Easy extension to stochastic gradient. Can be seen as a robust, regularized perceptron.

Non linear Subgradient SVM Same up-to the change x Φ(x), w t+1 = w t 1 t ( 1 n ) S i (w t ) + 2λw t S i (w) = { y i x i if y i w Φ(x i ) 1 0 otherwise

Representer theorem for SVM By induction (exercise) c t+1 = c t 1 t ( 1 n ) S i (c t ) + 2λc t with and S i (c t ) = f t (x) = x x i (c t ) i { y i e i if y i f t (x i ) < 1 0 otherwise where e i the i-th element of the canonical basis.

Kernel SVM by subgradient Given a kernel K, c t+1 = c t 1 t ( 1 n ) S i (c t ) + 2λc t with and f t (x) = S i (c t ) = K(x, x i )(c t ) i { y i e i if y i f t (x i ) < 1 0 otherwise where e i the i-th element of the canonical basis.

Complexity Without representer Logistic: O(ndT ) SVM: O(ndT ) With representer Logistic: O(n 2 (d + T )) SVM: O(n 2 (d + T )) n number of example, d dimensionality, T number of steps

But why are these called support vector machines???

Optimality condition for SVM Smooth Convex F (w ) = 0 Non-smooth Convex 0 F (w)

Optimality condition for SVM Smooth Convex F (w ) = 0 Non-smooth Convex 0 F (w) 0 F (w ) 0 1 y i w x i + + λ2w w 1 2λ 1 y iw x i +

Optimality condition for SVM (cont.) The optimality condition can be rewritten as 0 = 1 n ( y i x i c i ) + 2λw w = where c i = c i (w) [V ( y i w x i ), V + ( y i w x i )]. x i ( y ic i 2λn ). A direct computation gives c i = 1 if yf(x i ) < 1 0 c i 1 if yf(x i ) = 1 c i = 0 if yf(x i ) > 1

Support vectors c i = 1 if yf(x i ) < 1 0 c i 1 if yf(x i ) = 1 c i = 0 if yf(x i ) > 1

Are loss functions all the same? min Ê(w) + λ w 2 w each loss has a different target function...... and different computations The choice of the loss is problem dependent

This class Learning and regularization: logistic regression and SVM Optimization with first order methods Linear and non-linear parametric models Non-parametric models and kernels

Next class Beyond penalization Regularization by projection early-stopping