Kernel Learning via Random Fourier Representations

Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 1 / 30

Outline 1 Background Theory Recap of Kernels Fourier features Learning on distributions 2 Optimisation Methods 3 Experiments L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 2 / 30

Recap of Kernels Data {(x i, y i )} n i=1, x i X a non-empty set, y R.. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 3 / 30

Recap of Kernels Data {(x i, y i )} n i=1, x i X a non-empty set, y R. For instance, want a function f : X R to model relationship between x i and y i.. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 3 / 30

Recap of Kernels Data {(x i, y i )} n i=1, x i X a non-empty set, y R. For instance, want a function f : X R to model relationship between x i and y i. A kernel is a function k : X X R such that k(x, y) = φ(x), φ(y) H, H some Hilbert space, and φ : X H.. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 3 / 30

Recap of Kernels Data {(x i, y i )} n i=1, x i X a non-empty set, y R. For instance, want a function f : X R to model relationship between x i and y i. A kernel is a function k : X X R such that k(x, y) = φ(x), φ(y) H, H some Hilbert space, and φ : X H. Moore-Aronszajn Theorem: every positive semi-definite function is a reproducing kernel for some (unique) reproducing kernel Hilbert space.. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 3 / 30

Recap of Kernels Data {(x i, y i )} n i=1, x i X a non-empty set, y R. For instance, want a function f : X R to model relationship between x i and y i. A kernel is a function k : X X R such that k(x, y) = φ(x), φ(y) H, H some Hilbert space, and φ : X H. Moore-Aronszajn Theorem: every positive semi-definite function is a reproducing kernel for some (unique) reproducing kernel Hilbert space. So let s start with a positive-semidefinite function, e.g. k(x, y) = exp( 1 2 x y 2 ). L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 3 / 30

Building the RKHS We construct the RKHS within the space of functions f : X R. L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 4 / 30

Building the RKHS We construct the RKHS within the space of functions f : X R. Take functions of the form k(, x) and linear combinations: f ( ) = n a i k(, x i ). i=1 L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 4 / 30

Building the RKHS We construct the RKHS within the space of functions f : X R. Take functions of the form k(, x) and linear combinations: f ( ) = n a i k(, x i ). i=1 Define inner product, for g( ) = m j=1 b jk(, y j ), f, g Hk := n m a i b j k(x i, y j ). i=1 j=1 L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 4 / 30

Building the RKHS We construct the RKHS within the space of functions f : X R. Take functions of the form k(, x) and linear combinations: f ( ) = n a i k(, x i ). i=1 Define inner product, for g( ) = m j=1 b jk(, y j ), f, g Hk := n m a i b j k(x i, y j ). i=1 j=1 k(, x), g Hk = g(x), and in particular, k(x, y) = k(, x), k(, y) Hk. L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 4 / 30

Representer Theorem Theorem Let L be a general loss function and Ω an increasing function. For the regularised risk argmin L ( (x 1, y 1, f (x 1 )),..., (x n, y n, f (x n )) ) + Ω ( f 2 ) H k f H k there exists a minimiser f H k of the form f ( ) = n a i k(, x i ) i=1 for some (a 1,..., a n ) R n. If Ω is strictly increasing, then each minimiser of the regularised risk admits such a representation. L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 5 / 30

Kernel Methods Kernel methods allows us to work in a high dimensional space, possibly infinite without explicitly working in them. However, it comes with an expensive computational cost, especially when the data set is large. This motivates the idea of Random Fourier features, which approximates the kernel through a lower dimensional explicit feature map.. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 6 / 30

Random Fourier features Theorem (Bochner s theorem) A continuous, translation-invariant kernel k(x, y) = k(x y) is the Fourier transform of a non-negative measure: k(x y) = e i ωt (x y) p(ω)dω = E ω p [ζ ω (x)ζ ω (y) ] R d where ζ ω (x) := e i ωt x. Idea: sample ω 1,..., ω m frequencies independently from p and use a Monte Carlo estimator ˆk(x, y) = 1 m m [ cos(ω T j x) cos(ωj T y) + sin(ωj T x) sin(ωj T y) ]. j=1 L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 7 / 30

Random Fourier features ˆk(x, y) = 1 m [ m j=1 cos(ω T j x) cos(ωj T y) + sin(ωj T x) sin(ωj T y) ] In particular we now have a explicit feature map: 1 [ ( ) ( ) ( ) ( )] φ(x) := cos ω1 x, sin ω1 x..., cos ω m mx, sin ωmx A choice of the kernel has to be pre-specified for the RFF and choosing a good kernel in general is an open and challenging question. L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 8 / 30

Neural Network approach Attempt to learn the optimal kernel through some data dependent fashion using a one hidden layer neural network with a specific activation function given by: ŷ = 1 m ( ) βj cos cos ωj x + m j=1 m j=1 ( x) βj sin sin ωj (1) L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 9 / 30

Neural Network- Picture L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 10 / 30

Learning on Distributions Now, for each label y i rather than a single observation, we observe a bag of samples x i,1,..., x i,n iid xi, where x i is a distribution on R d. let ˆx i denote the empirical distribution associated with the ith bag.. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 11 / 30

Learning on Distributions Now, for each label y i rather than a single observation, we observe a bag of samples x i,1,..., x i,n iid xi, where x i is a distribution on R d. let ˆx i denote the empirical distribution associated with the ith bag. Let k : R d R d R be a kernel. Use k to embed ˆx i : ˆx i µˆxi = 1 N N j=1 k(, x i,j) H k. We can approximate k using RFF as before; using frequencies ω 1,..., ω m. We now consider another kernel K, initially defined on space of mean-embedded distributions, restricted to the µˆxi.. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 11 / 30

Learning on Distributions Now, for each label y i rather than a single observation, we observe a bag of samples x i,1,..., x i,n iid xi, where x i is a distribution on R d. let ˆx i denote the empirical distribution associated with the ith bag. Let k : R d R d R be a kernel. Use k to embed ˆx i : ˆx i µˆxi = 1 N N j=1 k(, x i,j) H k. We can approximate k using RFF as before; using frequencies ω 1,..., ω m. We now consider another kernel K, initially defined on space of mean-embedded distributions, restricted to the µˆxi. Since we have approximated k using RFF, our space is now finite-dimensional (parameterised by the frequencies), hence we can approximate K using RFF also.. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 11 / 30

Optimisation problem The problem becomes minimising the following objective... Objective T = 1 n n (y i ŷ i ) 2 + λ β 2 + µ Ω 2 (2) i=1 L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 12 / 30

Optimisation problem The problem becomes minimising the following objective... Objective T = 1 n n (y i ŷ i ) 2 + λ β 2 + µ Ω 2 (2) i=1 where the forward propagation is ŷ i = S 1 m m βj cos cos(wj T x i ) + βj sin sin(wj T x i ) (3) m j=1 j=1 L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 12 / 30

Optimisation problem The problem becomes minimising the following objective... Objective T = 1 n n (y i ŷ i ) 2 + λ β 2 + µ Ω 2 (2) i=1 where the forward propagation is ŷ i = S 1 m m βj cos cos(wj T x i ) + βj sin sin(wj T x i ) (3) m j=1 and S is an activation function. j=1 L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 12 / 30

Optimisation problem The problem becomes minimising the following objective... Objective T = 1 n n (y i ŷ i ) 2 + λ β 2 + µ Ω 2 (2) i=1 where the forward propagation is ŷ i = S 1 m m βj cos cos(wj T x i ) + βj sin sin(wj T x i ) (3) m j=1 and S is an activation function. Back propagation is the learning of the parameters β and Ω. j=1. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 12 / 30

Gradient Descent Given the forward propagation, update the parameters according to the gradient evaluated at the current parameters. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 13 / 30

Gradient Descent Given the forward propagation, update the parameters according to the gradient evaluated at the current parameters Gradient Descent β β ɛ β T (4) Ω Ω ɛd Ω T (5) L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 13 / 30

Gradient Descent Given the forward propagation, update the parameters according to the gradient evaluated at the current parameters Gradient Descent where ɛ is a user defined step size. β β ɛ β T (4) Ω Ω ɛd Ω T (5) L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 13 / 30

Stochastic Gradient Descent Similar to gradient descent but only uses one data point per iteration to evaluate the gradient. L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 14 / 30

Stochastic Gradient Descent Similar to gradient descent but only uses one data point per iteration to evaluate the gradient. Let T i = (y i ŷ i ) 2 + λ β 2 + µ Ω 2 (6) L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 14 / 30

Stochastic Gradient Descent Similar to gradient descent but only uses one data point per iteration to evaluate the gradient. Let T i = (y i ŷ i ) 2 + λ β 2 + µ Ω 2 (6) Stochastic Gradient Descent On the ith iteration β β ɛ β T mod(i 1,n)+1 (7) Ω Ω ɛd Ω T mod(i 1,n)+1. (8) L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 14 / 30

Quasi-Newton Similar to gradient descent but step size varies according to the rate of change of the gradient. L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 15 / 30

Quasi-Newton Similar to gradient descent but step size varies according to the rate of change of the gradient. Quasi-Newton [ ] 1 θ θ θ T θ T θ T (9) The Hessian θ T θ T is approximated. L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 15 / 30

Timings Gradient descent Stochastic gradient descent Quasi-Newton 180 180 180 160 160 160 140 140 140 Time took to optimize (s) 120 100 80 60 Time took to optimize (s) 120 100 80 60 Time took to optimize (s) 120 100 80 60 40 40 40 20 20 20 0 2 4 8 16 32 64 m 0 2 4 8 16 32 64 m 0 2 4 8 16 32 64 m L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 16 / 30

Training Error Gradient descent Stochastic gradient descent Quasi-Newton 0.2 0.2 0.2 0.15 0.15 0.15 Training Error 0.1 Training Error 0.1 Training Error 0.1 0.05 0.05 0.05 0 2 4 8 16 32 64 m 0 2 4 8 16 32 64 m 0 2 4 8 16 32 64 m L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 17 / 30

Test Error Gradient descent Stochastic gradient descent Quasi-Newton 0.45 0.45 0.45 0.4 0.4 0.4 Test Error 0.35 Test Error 0.35 Test Error 0.35 0.3 0.3 0.3 0.25 0.25 0.25 0.2 0.2 0.2 2 4 8 16 32 64 m 2 4 8 16 32 64 m 2 4 8 16 32 64 m L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 18 / 30

Classification Experiment Adult Dataset [3] D = {x i, y i } N i=1, N = 48842, x i R 108, y i = {0, 1} A)Two layer neural network Quadratic Loss λ = 1.5 ɛ = 0.1 M = 100 Result = 15.5% misclassifications B) Random Fourier Features Ridge Regression λ = 0.001 σ 2 = 0.005 M = 1000 Result = 15.4% misclassifications Parameter Selection Cross Validation? experimentation, overfitting L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 19 / 30

L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 20 / 30

Comparison of frequencies L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 21 / 30

PCA 5.0 2.5 V36 V2 PC2 (3.4% explained var.) 0.0 V42 V15 V41 V5 V70 V105 V101 V75 V30 V50 V76 V100 V99 V83 V77 V88 V32 V19 V43 V93 V94 V4 V82 V54 V74 V37 V16 V17 V45 V25 V47 V34 V12 V92 V29 V8 V35 V67 V98 V89 V90 V44 V20 V71 V13 V57 V18 V9 V55 V68 V49 V73 V51 V23 V10 V14 V26 V38 V33 V39 V53 V62 V69 V78 V80 V79 V72 V60 V46 V84 V85 V86 V108 V87 V104 V107 V91 V22 V48 V96 V11 V6 V27 V58 V81 V97 V102 V103 V95 V3 V7 V59 V24 V56 V40 V61 V21 V52 V31 V63 V65 V66 V106 V64 V1 V28 2.5 L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 22 / 30

Loss L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 23 / 30

Regression Experiment Aerosol Dataset [4] B k = {(x i,j, y i ) B j=1 }N i=1 where i k = 100 k {1,.., K = 800}, x i R 16, y i R A) Three layers Neural Network with mean pooling Quadratic Loss M 1 = 50 M 2 = 50 λ = 0.05 ɛ = 0.1 Result = 0.075 RMSE B) Random Fourier Features with mean pooling Ridge Regression λ = 0.05 M 1 = 50 M 2 = 50 σ 2 1 = σ2 2 = 0.5 Result = 0.09 RMSE L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 24 / 30

L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 25 / 30

Comparison of frequencies L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 26 / 30

V15 V4 V45 V50 PCA 4 2.5 PC2 (11.6% explained var.) 2 0 V8 V2 V16 V13 V14 V1 V3 V12 V9 V7 V6 V5 PC2 (4.9% explained var.) 0.0 V7 V29 V40 V77 V34 V2 V54 V88 V90 V9 V16 V15 V10 V66 V52 V19 V99 V93 V62 V71 V5 V13 V17 V76 V14 V25 V6 V58 V72 V91 V80 V84 V59 V30 V48 V27 V4 V95 V87 V37 V51 V53 V68 V56 V100 V83 V75 V78 V42 V43 V94 V18 V3 V21 V31 V11 V28 V41 V49 V32 V20 V55 V39 V60 V92 V97 V8 V23 V35 V1 V38 V70 V63 V82 V89 V86 V44 V65 V96 V57 V26 V74 V12 V47 V69 V81 V24 V85 V46 V61 V98 V36 V67 V79 V22 V64 V33 V73 V10 V11 2.5 2 2 0 2 PC1 (14.5% explained var.) 4 2 0 2 PC1 (5.4% explained var.) L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 27 / 30

Loss L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 28 / 30

Conclusions Competetive methods Methods give very similar and accurate results!! Neural Network More computational cost. Potential tonoutperform in predictive power. RFF Fast. Require kernel determination. Simpler structure and interpretability. L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 29 / 30

For Further Reading I Hofmann, T., Scholkopf, B. and Smola, A.J. (2008) Kernel Methods in Machine Learning, Annals of Statistics, Vol. 36, No. 3, 11711220. Gretton, A. (2015) Advanced Topics in Machine Learning, course notes, available at http://www.gatsby.ucl.ac.uk/~gretton/ coursefiles/rkhscourse.html. Rahimi, A., Recht, B. (2007) Random Features for Large-Scale Kernel Machines, Advances in Neural Information Processing Systems (NIPS). Szabo, Z., Gretton, A., Poczos, B., Sriperumbudur, B.K. (2015) Two-stage Sampled Learning Theory on Distributions, International Conference on Artificial Intelligence and Statistics (AISTATS). L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 30 / 30