Computing regularization paths for learning multiple kernels

Size: px

Start display at page:

Download "Computing regularization paths for learning multiple kernels"

Melinda Carson
6 years ago
Views:

1 Computing regularization paths for learning multiple kernels Francis Bach Romain Thibaux Michael Jordan Computer Science, UC Berkeley December, 24 Code available at

2 Computing regularization paths for learning multiple kernels Kernel methods for supervised learning: Predict y from x as f(x) = w Φ(x) Learning from data (x i, y i ), i = 1,..., n Optimization problem: minimize i l(y i, w Φ(x i )) + λ w 2 Two major issues: training error + regularization Choosing Φ(x), i.e., the kernel k(x, y) = Φ(x) Φ(y) Choosing the regularization parameter λ

3 Learning multiple kernels and regularization paths Search over conic combinations k(x, y) = j η jk j (x, y), η j Equivalent to using Φ(x) = (Φ 1 (x),...,φ m (x)), w = (w 1,..., w m ) and a block 1-norm: minimize i l(y i, j w j Φ j(x i )) + λ j w j Assume Φ j (x), j = 1,..., m, known, and solve for all λ compute the regularization path: w (λ), λ R + Potential gains: Theoretical: understand block 1-norm regularization better Practical: get the entire path at the cost of one point

4 Classical kernel learning (2-norm regularization) ( Primal problem min w i ϕ i(w Φ(x i )) + λ 2 ( w 2) Dual problem max α R n i ψ i(λα i ) λ 2 α Kα ) Optimality conditions i,(kα) i + ψ i (λα i) = Assumptions on loss ϕ i : ϕ i (u) strictly convex twice differentiable ψ i (v) Fenchel conjugate of ϕ i (u), i.e., ψ i (v) = max u R (vu ϕ i (u)) Least-squares regression Logistic regression ϕ i (u) ψ i (v) 1 2 (y i u) v2 + vy i log(1 + exp( y i u i )) (1+vy i ) log(1+vy i ) vy i log( vy i )

5 Block 1-norm regularization m feature spaces F j and feature maps Φ j (x): Primal problem: min w F 1 F m i ϕ i j wj Φ(x l ) + λ j d j w j. Convex non differentiable : reformulation using conic constraints Dual problem: max α i ψ i(λα i ) such that j, α K j α d 2 j

6 Block 1-norm regularization Optimality conditions: i, ( j η jk j α) i + ψ i (λα i) = j, α K j α d 2 j, η j, η j (d 2 i α K j α) =. Optimal solution α, solution of the 2-norm problem with a conic combination of basis kernels: K = j η j K j

7 Geometric intepretation Dual problem: max α i ψ i(λα i ) such that j, α K j α d 2 j target : β i = arg maxψ i (v)

8 Active sets If J = {j, η j > } is known, solution (α, η) defined by i, ( j J η jk j α) i + ψ i (λα i) = j J, α K j α = d 2 j n + J differentiable equations with n + J unknowns smooth path, easy to follow, but... Valid while η j, j J, and α K j α d 2 j, j / J. Change of active sets piecewise smooth path, hard to follow NB: with one kernel, path is piecewise linear (Hastie et at., 24)

9 Log-barrier regularization Dual problem: max α i ψ i(λα i ) such that j, α K j α d 2 j Regularized dual problem: Properties: max α i ψ i(λα i ) + µ j log(d2 j α K j α) Unconstrained concave maximization η function of α α is unique α(λ) differentiable function, easy to follow

10 Follow solution of F(α, λ) = Predictor steps Predictor-corrector method First order approximation using dα dλ = ( F α Corrector steps Newton s method to converge back to solution ) 1 F λ

11 Predictor-corrector method: implementation issues Step-size selection for predictor step: δσ adaptive selection Second order approximation

12 Initialization if ( β λ ) ( ) β Kj λ d 2 j, then α = β λ is solution Initialize using λ = max j (β K j β/d 2 j )1/2 and α = β/λ

13 Link with interior point methods Regularized dual problem: max α i ψ i(λα i ) + µ j log(d2 j α K j α) Interior point methods: λ fixed, µ followed from large to small Regularization path: µ fixed small, λ followed from large to small

14 Computational complexity n number of data points, m number of kernels Interior point method to obtain one solution: O(mn 3 ) Path following method Each predictor-corrector step: O(n 3 ) Empirically O(m) steps Total complexity O(mn 3 )

15 Simulations Set up for given supervised learning problem: Build a large number of classical kernels Perform path following Compute performance on held out validation data Goals: Select best regularization parameter Understand how regularization behaves

16 Simple example Left: regression, right: classification η η η j is not a monotonic function of λ Canonical behavior for extreme values of λ

17 Training/testing error mean square error Canonical behavior as λ decreases number of kernels Training performance decreases to zero Testing performance decreases, increases, then stabilizes Importance of d j (weight of penalization = j d j w j ) d j should be ( an increasing function of the rank of K j : d j = number of eigenvalue 1 ) γ 2n γ small d j rank independent

18 Importance of d j Left: γ =, right: γ = 1 Top: training (bold)/testing (dashed) error bottom: number of kernels Regression (Boston dataset) Classification (Liver dataset) mean square error mean square error error rate error rate number of kernels number of kernels number of kernels number of kernels

19 Conclusion Computing regularization paths for multiple kernels Same complexity than solving for one point Theoretical understanding of regularization Practical implications Future work: Theoretical complexity results Efficient implementation: from cubic to quadratic in n

Support Vector Machines

Support Vector Machines Support vector machines (SVMs) are one of the central concepts in all of machine learning. They are simply a combination of two ideas: linear classification via maximum (or optimal