IKA: Independent Kernel Approximator

Size: px

Start display at page:

Download "IKA: Independent Kernel Approximator"

Frank Elijah Welch
5 years ago
Views:

1 IKA: Independent Kernel Approximator Matteo Ronchetti, Università di Pisa his wor started as my graduation thesis in Como with Stefano Serra Capizzano. Currently is being developed with the help of Federico Poloni.

2 Introduction A ernel K(x, y) : R d given x, x,, x 1 2 n R d R d R is a bivariate function such that the Gramian matrix G = K(x, x 1 1) K(x, x 1 2) K(x, x 1 n) K(x, x 2 1) K(x, x 2 2) K(x, x 2 n) K(x, x N 1) K(x, x N 2) K(x, x n n) R n n is symmetric positive semidefinite.

3 Low ran ernel approximation We want to find matrices B R n c and W such that We add the following constraints: G BW B R c c B = b ij j (x i) : possibility of approximating the ernel on new datapoints. Ran(W ) : possibility of trasnforming the approximation into the more e icient G BW B = CC where C R n

4 Nyström method A well nown method for low ran ernel approximation is the Nyström method: G = ( G G G 12 G 22 ( G 11 G G ) 11 ( G 11 G 12 ) ). 12 his can be generalized (Drineas et al. 2005) as G BW B where B = ( G 11 W = G ) ( argmin X G 11 ). 12 r(x) + Notice that Nyström forces the choice b i(x) = K(x i, x).

5 Nyström features he computation of W is e icient. he approximation is optimal only on the c c bloc G. 11 he forced choice b i(x) = K(x i, x) approximation expensive. can mae online

6 Ideas behind IKA Freely chose the functions {b (x), b (x),, b (x)}. 1 2 c Project the eigenfunctions of the ernel on Span(b (x), b (x),, b 1 2 c(x)). Computing the projected eigenfunctions reduces to solving an c c generalized eigenvalue problem. Approximate the ernel using the eigenfunctions. leading projected

7 he approximation scenario Given the datapoints x, x,, x it is possible to define an inner product between real valued functions: 1 2 n R d f, g : R d 1 R f, g = f(x i)g(x i). n n i=1 he ernel K defines a self-adjoint linear operator: n 1 Kf(x) = K(x, ), f( ) = K(x, x i)f(x i). n i=1 he eigenfunctions of K satisfy the following properties: Kϕ i(x) = λ ϕ (x) ϕ, ϕ i i i j = δ λ ij i R +

8 Projected eigenfunctions Assume that B R n c (where B = b ij j (x i) ) has full column ran and define B B c c P = R M = n B GB n 2 c c R. he projected eigenfunctions of K generalized eigenvalue problem: can be found by solving the Where Λ = diag(λ, λ,, λ 1 2 c), Φ P Φ = I and is the i-th projected eigenfunction. MΦ = P ΦΛ f i (x) = Φ b ji j (x) c j=1

9 Kernel approximation he ernel is approximated using the first eigenfunctions: projected In matrix form this corresponds to K(x, y) λ f (x)f (y) i=1 i i i G BΦΛ Φ B = BW B where Λ = diag(λ, λ,, λ 1 2, 0,, 0).

10 Optimality he approximation computed by IKA ( W = ΦΛ Φ ) is optimal on the whole matrix G with respect to Frobenius and spectral norm: W W = ΦΛ Φ = ΦΛ Φ = argmin 2 G BXB F r(x) = argmin 2 G BXB 2 r(x)

11 Subsampling of G It is possible to use setting: m n points to compute the approximation by G = ( G G B = G ) ( B G B B B 1 1 P = M = m B G B m ) Where G 11 R m m and B 1 R m c. he approximation will be optimal on G 11, the error on the other blocs will depend on the choice of B. Because of the optimality on G 11 it should be possible to study this error using existing results on approximate CUR matrix factorization.

12 Comparison with Nyström method IKA computes the same approximation as Nyström when setting b i(x) = K(x i, x) and m = c. he advantages of IKA are: Possitility of choosing b i (x) = K(x i, x) so that online approximation can be e icient even if the evaluation of is not. K(x, y) Possibility of setting m > c : "see" a bigger part of G to produce a better approximation.

13 Numerical Results We compared the Frobenious norm error committed by IKA and Nyström when approximating an RBF ernel on 5'000 unseen points. Both methods use the same c = 128 functions. Let E, E, E N I O be the errors committed respectively by Nyström, IKA and the optimal approximation of the ernel on test data. We measure the improvement of IKA using two quantities: Δ a Δ r = 1 = 1 E I E N E I E N E E O O

14 Numerical Results Method m Error Δ Δ ime Optimal % 100.0% -- Nyström % 0.0% 9 ms IKA % 11.8% 17 ms IKA % 36.3% 31 ms IKA % 56.4% 80 ms IKA % 74.4% 248 ms IKA % 87.5% 802 ms IKA % 95.4% 3.2 s IKA % 99.2% 13.9 s a r

15 Numerical Results he same comparison is repeated, fixing m = varying the ran of the approximation. for IKA and Nyström error IKA error Δ a % % % % % % Note that Nyström requires an approximation ran performances of IKA at ran. 2 to match the

16 Future Wor Choose the functions when G is sparse. b (x) i such that the method is e icient Study the choice of functions approximation error. b (x) i with respect to e iciency and Generalize to approximate non square matrices. Analyze performances on practical applications: ernel approximation, manifold learning, recommender systems, etc.

17 References owards More E icient SPSD Matrix Approximation and CUR Matrix Decomposition, Shusen Wang, Zhihua Zhang and ong Zhang, Journal of Machine Learning Research (2016) On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning, Petros Drineas, Michael W. Mahoney, Journal of Machine Learning Research (2005) Spectral Clustering and Kernel PCA are Learning Eigenfunctions, Bengio, Vincent and Paiement, CIRANO (2003) Singular value (and eigenvalue) distribution and Krylov preconditioning of sequences of sampling matrices approximating integral operators, A.S. Al Fhaid, S. Serra Capizzano, D. Sesana and M. Zaa Ullah, Numerical Linear Algebra with Applications (2014)

Principal Component Analysis

Principal Component Analysis CSci 5525: Machine Learning Dec 3, 2008 The Main Idea Given a dataset X = {x 1,..., x N } The Main Idea Given a dataset X = {x 1,..., x N } Find a low-dimensional linear projection The Main Idea Given