Kernel Learning via Random Fourier Representations

Similar documents
Approximate Kernel PCA with Random Features

TUM 2016 Class 3 Large scale learning by regularization

Approximate Kernel Methods

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Stochastic optimization in Hilbert spaces

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Simple Optimization, Bigger Models, and Faster Learning. Niao He

CIS 520: Machine Learning Oct 09, Kernel Methods

Bits of Machine Learning Part 1: Supervised Learning

Convergence Rates of Kernel Quadrature Rules

Distribution Regression

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Distribution Regression with Minimax-Optimal Guarantee

Advanced Introduction to Machine Learning

Efficient Complex Output Prediction

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Support Vector Machine

(i) The optimisation problem solved is 1 min

Minimax-optimal distribution regression

RegML 2018 Class 2 Tikhonov regularization and kernels

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Distribution Regression: A Simple Technique with Minimax-optimal Guarantee

Linear & nonlinear classifiers

Kernels A Machine Learning Overview

Generalization theory

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Degrees of Freedom in Regression Ensembles

Normalization Techniques in Training of Deep Neural Networks

Presentation in Convex Optimization

Support Vector Machines for Classification: A Statistical Portrait

ONLINE LEARNING WITH KERNELS: OVERCOMING THE GROWING SUM PROBLEM. Abhishek Singh, Narendra Ahuja and Pierre Moulin

Kernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444

A Magiv CV Theory for Large-Margin Classifiers

Mathematical Methods for Data Analysis

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Bayesian Machine Learning

10-701/ Recitation : Kernels

Computational statistics

Less is More: Computational Regularization by Subsampling

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Regression with Numerical Optimization. Logistic

Less is More: Computational Regularization by Subsampling

Non-Linear Regression for Bag-of-Words Data via Gaussian Process Latent Variable Set Model

Oslo Class 2 Tikhonov regularization and kernels

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

A summary of Deep Learning without Poor Local Minima

Neural Networks: Backpropagation

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Kernel Methods. Outline

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

6.036 midterm review. Wednesday, March 18, 15

Kernels for Multi task Learning

Neural networks and support vector machines

Statistical Machine Learning Hilary Term 2018

Kernel Adaptive Metropolis-Hastings

Artificial Neural Networks. MGS Lecture 2

Probabilistic & Unsupervised Learning

ECS289: Scalable Machine Learning

Random Features for Large Scale Kernel Machines

Introduction to Machine Learning

Logistic Regression. COMP 527 Danushka Bollegala

Lecture 10: Support Vector Machine and Large Margin Classifier

Linear & nonlinear classifiers

Reproducing Kernel Hilbert Spaces

Kernel methods, kernel SVM and ridge regression

Scaling up Vector Autoregressive Models with Operator Random Fourier Features

Back to the future: Radial Basis Function networks revisited

Kernel Methods. Barnabás Póczos

Learning gradients: prescriptive models

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Neural networks and optimization

5.6 Nonparametric Logistic Regression

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Midterm exam CS 189/289, Fall 2015

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Support Vector Machine (SVM) and Kernel Methods

The Representor Theorem, Kernels, and Hilbert Spaces

Kernel Methods. Machine Learning A W VO

Machine Learning Lecture 5

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Statistical Machine Learning from Data

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Machine Learning - MT & 14. PCA and MDS

LMS Algorithm Summary

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Statistical learning on graphs

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM

Neural Network Training

Transcription:

Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 1 / 30

Outline 1 Background Theory Recap of Kernels Fourier features Learning on distributions 2 Optimisation Methods 3 Experiments L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 2 / 30

Recap of Kernels Data {(x i, y i )} n i=1, x i X a non-empty set, y R.. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 3 / 30

Recap of Kernels Data {(x i, y i )} n i=1, x i X a non-empty set, y R. For instance, want a function f : X R to model relationship between x i and y i.. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 3 / 30

Recap of Kernels Data {(x i, y i )} n i=1, x i X a non-empty set, y R. For instance, want a function f : X R to model relationship between x i and y i. A kernel is a function k : X X R such that k(x, y) = φ(x), φ(y) H, H some Hilbert space, and φ : X H.. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 3 / 30

Recap of Kernels Data {(x i, y i )} n i=1, x i X a non-empty set, y R. For instance, want a function f : X R to model relationship between x i and y i. A kernel is a function k : X X R such that k(x, y) = φ(x), φ(y) H, H some Hilbert space, and φ : X H. Moore-Aronszajn Theorem: every positive semi-definite function is a reproducing kernel for some (unique) reproducing kernel Hilbert space.. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 3 / 30

Recap of Kernels Data {(x i, y i )} n i=1, x i X a non-empty set, y R. For instance, want a function f : X R to model relationship between x i and y i. A kernel is a function k : X X R such that k(x, y) = φ(x), φ(y) H, H some Hilbert space, and φ : X H. Moore-Aronszajn Theorem: every positive semi-definite function is a reproducing kernel for some (unique) reproducing kernel Hilbert space. So let s start with a positive-semidefinite function, e.g. k(x, y) = exp( 1 2 x y 2 ). L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 3 / 30

Building the RKHS We construct the RKHS within the space of functions f : X R. L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 4 / 30

Building the RKHS We construct the RKHS within the space of functions f : X R. Take functions of the form k(, x) and linear combinations: f ( ) = n a i k(, x i ). i=1 L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 4 / 30

Building the RKHS We construct the RKHS within the space of functions f : X R. Take functions of the form k(, x) and linear combinations: f ( ) = n a i k(, x i ). i=1 Define inner product, for g( ) = m j=1 b jk(, y j ), f, g Hk := n m a i b j k(x i, y j ). i=1 j=1 L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 4 / 30

Building the RKHS We construct the RKHS within the space of functions f : X R. Take functions of the form k(, x) and linear combinations: f ( ) = n a i k(, x i ). i=1 Define inner product, for g( ) = m j=1 b jk(, y j ), f, g Hk := n m a i b j k(x i, y j ). i=1 j=1 k(, x), g Hk = g(x), and in particular, k(x, y) = k(, x), k(, y) Hk. L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 4 / 30

Representer Theorem Theorem Let L be a general loss function and Ω an increasing function. For the regularised risk argmin L ( (x 1, y 1, f (x 1 )),..., (x n, y n, f (x n )) ) + Ω ( f 2 ) H k f H k there exists a minimiser f H k of the form f ( ) = n a i k(, x i ) i=1 for some (a 1,..., a n ) R n. If Ω is strictly increasing, then each minimiser of the regularised risk admits such a representation. L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 5 / 30

Kernel Methods Kernel methods allows us to work in a high dimensional space, possibly infinite without explicitly working in them. However, it comes with an expensive computational cost, especially when the data set is large. This motivates the idea of Random Fourier features, which approximates the kernel through a lower dimensional explicit feature map.. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 6 / 30

Random Fourier features Theorem (Bochner s theorem) A continuous, translation-invariant kernel k(x, y) = k(x y) is the Fourier transform of a non-negative measure: k(x y) = e i ωt (x y) p(ω)dω = E ω p [ζ ω (x)ζ ω (y) ] R d where ζ ω (x) := e i ωt x. Idea: sample ω 1,..., ω m frequencies independently from p and use a Monte Carlo estimator ˆk(x, y) = 1 m m [ cos(ω T j x) cos(ωj T y) + sin(ωj T x) sin(ωj T y) ]. j=1 L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 7 / 30

Random Fourier features ˆk(x, y) = 1 m [ m j=1 cos(ω T j x) cos(ωj T y) + sin(ωj T x) sin(ωj T y) ] In particular we now have a explicit feature map: 1 [ ( ) ( ) ( ) ( )] φ(x) := cos ω1 x, sin ω1 x..., cos ω m mx, sin ωmx A choice of the kernel has to be pre-specified for the RFF and choosing a good kernel in general is an open and challenging question. L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 8 / 30

Neural Network approach Attempt to learn the optimal kernel through some data dependent fashion using a one hidden layer neural network with a specific activation function given by: ŷ = 1 m ( ) βj cos cos ωj x + m j=1 m j=1 ( x) βj sin sin ωj (1) L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 9 / 30

Neural Network- Picture L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 10 / 30

Learning on Distributions Now, for each label y i rather than a single observation, we observe a bag of samples x i,1,..., x i,n iid xi, where x i is a distribution on R d. let ˆx i denote the empirical distribution associated with the ith bag.. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 11 / 30

Learning on Distributions Now, for each label y i rather than a single observation, we observe a bag of samples x i,1,..., x i,n iid xi, where x i is a distribution on R d. let ˆx i denote the empirical distribution associated with the ith bag. Let k : R d R d R be a kernel.. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 11 / 30

Learning on Distributions Now, for each label y i rather than a single observation, we observe a bag of samples x i,1,..., x i,n iid xi, where x i is a distribution on R d. let ˆx i denote the empirical distribution associated with the ith bag. Let k : R d R d R be a kernel. Use k to embed ˆx i : ˆx i µˆxi = 1 N N j=1 k(, x i,j) H k.. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 11 / 30

Learning on Distributions Now, for each label y i rather than a single observation, we observe a bag of samples x i,1,..., x i,n iid xi, where x i is a distribution on R d. let ˆx i denote the empirical distribution associated with the ith bag. Let k : R d R d R be a kernel. Use k to embed ˆx i : ˆx i µˆxi = 1 N N j=1 k(, x i,j) H k. We can approximate k using RFF as before; using frequencies ω 1,..., ω m.. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 11 / 30

Learning on Distributions Now, for each label y i rather than a single observation, we observe a bag of samples x i,1,..., x i,n iid xi, where x i is a distribution on R d. let ˆx i denote the empirical distribution associated with the ith bag. Let k : R d R d R be a kernel. Use k to embed ˆx i : ˆx i µˆxi = 1 N N j=1 k(, x i,j) H k. We can approximate k using RFF as before; using frequencies ω 1,..., ω m. We now consider another kernel K, initially defined on space of mean-embedded distributions, restricted to the µˆxi.. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 11 / 30

Learning on Distributions Now, for each label y i rather than a single observation, we observe a bag of samples x i,1,..., x i,n iid xi, where x i is a distribution on R d. let ˆx i denote the empirical distribution associated with the ith bag. Let k : R d R d R be a kernel. Use k to embed ˆx i : ˆx i µˆxi = 1 N N j=1 k(, x i,j) H k. We can approximate k using RFF as before; using frequencies ω 1,..., ω m. We now consider another kernel K, initially defined on space of mean-embedded distributions, restricted to the µˆxi. Since we have approximated k using RFF, our space is now finite-dimensional (parameterised by the frequencies), hence we can approximate K using RFF also.. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 11 / 30

Optimisation problem The problem becomes minimising the following objective... Objective T = 1 n n (y i ŷ i ) 2 + λ β 2 + µ Ω 2 (2) i=1 L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 12 / 30

Optimisation problem The problem becomes minimising the following objective... Objective T = 1 n n (y i ŷ i ) 2 + λ β 2 + µ Ω 2 (2) i=1 where the forward propagation is ŷ i = S 1 m m βj cos cos(wj T x i ) + βj sin sin(wj T x i ) (3) m j=1 j=1 L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 12 / 30

Optimisation problem The problem becomes minimising the following objective... Objective T = 1 n n (y i ŷ i ) 2 + λ β 2 + µ Ω 2 (2) i=1 where the forward propagation is ŷ i = S 1 m m βj cos cos(wj T x i ) + βj sin sin(wj T x i ) (3) m j=1 and S is an activation function. j=1 L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 12 / 30

Optimisation problem The problem becomes minimising the following objective... Objective T = 1 n n (y i ŷ i ) 2 + λ β 2 + µ Ω 2 (2) i=1 where the forward propagation is ŷ i = S 1 m m βj cos cos(wj T x i ) + βj sin sin(wj T x i ) (3) m j=1 and S is an activation function. Back propagation is the learning of the parameters β and Ω. j=1. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 12 / 30

Gradient Descent Given the forward propagation, update the parameters according to the gradient evaluated at the current parameters. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 13 / 30

Gradient Descent Given the forward propagation, update the parameters according to the gradient evaluated at the current parameters Gradient Descent β β ɛ β T (4) Ω Ω ɛd Ω T (5) L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 13 / 30

Gradient Descent Given the forward propagation, update the parameters according to the gradient evaluated at the current parameters Gradient Descent where ɛ is a user defined step size. β β ɛ β T (4) Ω Ω ɛd Ω T (5) L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 13 / 30

Stochastic Gradient Descent Similar to gradient descent but only uses one data point per iteration to evaluate the gradient. L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 14 / 30

Stochastic Gradient Descent Similar to gradient descent but only uses one data point per iteration to evaluate the gradient. Let T i = (y i ŷ i ) 2 + λ β 2 + µ Ω 2 (6) L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 14 / 30

Stochastic Gradient Descent Similar to gradient descent but only uses one data point per iteration to evaluate the gradient. Let T i = (y i ŷ i ) 2 + λ β 2 + µ Ω 2 (6) Stochastic Gradient Descent On the ith iteration β β ɛ β T mod(i 1,n)+1 (7) Ω Ω ɛd Ω T mod(i 1,n)+1. (8) L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 14 / 30

Quasi-Newton Similar to gradient descent but step size varies according to the rate of change of the gradient. L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 15 / 30

Quasi-Newton Similar to gradient descent but step size varies according to the rate of change of the gradient. Quasi-Newton [ ] 1 θ θ θ T θ T θ T (9) The Hessian θ T θ T is approximated. L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 15 / 30

Timings Gradient descent Stochastic gradient descent Quasi-Newton 180 180 180 160 160 160 140 140 140 Time took to optimize (s) 120 100 80 60 Time took to optimize (s) 120 100 80 60 Time took to optimize (s) 120 100 80 60 40 40 40 20 20 20 0 2 4 8 16 32 64 m 0 2 4 8 16 32 64 m 0 2 4 8 16 32 64 m L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 16 / 30

Training Error Gradient descent Stochastic gradient descent Quasi-Newton 0.2 0.2 0.2 0.15 0.15 0.15 Training Error 0.1 Training Error 0.1 Training Error 0.1 0.05 0.05 0.05 0 2 4 8 16 32 64 m 0 2 4 8 16 32 64 m 0 2 4 8 16 32 64 m L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 17 / 30

Test Error Gradient descent Stochastic gradient descent Quasi-Newton 0.45 0.45 0.45 0.4 0.4 0.4 Test Error 0.35 Test Error 0.35 Test Error 0.35 0.3 0.3 0.3 0.25 0.25 0.25 0.2 0.2 0.2 2 4 8 16 32 64 m 2 4 8 16 32 64 m 2 4 8 16 32 64 m L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 18 / 30

Classification Experiment Adult Dataset [3] D = {x i, y i } N i=1, N = 48842, x i R 108, y i = {0, 1} A)Two layer neural network Quadratic Loss λ = 1.5 ɛ = 0.1 M = 100 Result = 15.5% misclassifications B) Random Fourier Features Ridge Regression λ = 0.001 σ 2 = 0.005 M = 1000 Result = 15.4% misclassifications Parameter Selection Cross Validation? experimentation, overfitting L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 19 / 30

L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 20 / 30

Comparison of frequencies L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 21 / 30

PCA 5.0 2.5 V36 V2 PC2 (3.4% explained var.) 0.0 V42 V15 V41 V5 V70 V105 V101 V75 V30 V50 V76 V100 V99 V83 V77 V88 V32 V19 V43 V93 V94 V4 V82 V54 V74 V37 V16 V17 V45 V25 V47 V34 V12 V92 V29 V8 V35 V67 V98 V89 V90 V44 V20 V71 V13 V57 V18 V9 V55 V68 V49 V73 V51 V23 V10 V14 V26 V38 V33 V39 V53 V62 V69 V78 V80 V79 V72 V60 V46 V84 V85 V86 V108 V87 V104 V107 V91 V22 V48 V96 V11 V6 V27 V58 V81 V97 V102 V103 V95 V3 V7 V59 V24 V56 V40 V61 V21 V52 V31 V63 V65 V66 V106 V64 V1 V28 2.5 L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 22 / 30

Loss L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 23 / 30

Regression Experiment Aerosol Dataset [4] B k = {(x i,j, y i ) B j=1 }N i=1 where i k = 100 k {1,.., K = 800}, x i R 16, y i R A) Three layers Neural Network with mean pooling Quadratic Loss M 1 = 50 M 2 = 50 λ = 0.05 ɛ = 0.1 Result = 0.075 RMSE B) Random Fourier Features with mean pooling Ridge Regression λ = 0.05 M 1 = 50 M 2 = 50 σ 2 1 = σ2 2 = 0.5 Result = 0.09 RMSE L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 24 / 30

L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 25 / 30

Comparison of frequencies L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 26 / 30

V15 V4 V45 V50 PCA 4 2.5 PC2 (11.6% explained var.) 2 0 V8 V2 V16 V13 V14 V1 V3 V12 V9 V7 V6 V5 PC2 (4.9% explained var.) 0.0 V7 V29 V40 V77 V34 V2 V54 V88 V90 V9 V16 V15 V10 V66 V52 V19 V99 V93 V62 V71 V5 V13 V17 V76 V14 V25 V6 V58 V72 V91 V80 V84 V59 V30 V48 V27 V4 V95 V87 V37 V51 V53 V68 V56 V100 V83 V75 V78 V42 V43 V94 V18 V3 V21 V31 V11 V28 V41 V49 V32 V20 V55 V39 V60 V92 V97 V8 V23 V35 V1 V38 V70 V63 V82 V89 V86 V44 V65 V96 V57 V26 V74 V12 V47 V69 V81 V24 V85 V46 V61 V98 V36 V67 V79 V22 V64 V33 V73 V10 V11 2.5 2 2 0 2 PC1 (14.5% explained var.) 4 2 0 2 PC1 (5.4% explained var.) L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 27 / 30

Loss L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 28 / 30

Conclusions Competetive methods Methods give very similar and accurate results!! Neural Network More computational cost. Potential tonoutperform in predictive power. RFF Fast. Require kernel determination. Simpler structure and interpretability. L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 29 / 30

For Further Reading I Hofmann, T., Scholkopf, B. and Smola, A.J. (2008) Kernel Methods in Machine Learning, Annals of Statistics, Vol. 36, No. 3, 11711220. Gretton, A. (2015) Advanced Topics in Machine Learning, course notes, available at http://www.gatsby.ucl.ac.uk/~gretton/ coursefiles/rkhscourse.html. Rahimi, A., Recht, B. (2007) Random Features for Large-Scale Kernel Machines, Advances in Neural Information Processing Systems (NIPS). Szabo, Z., Gretton, A., Poczos, B., Sriperumbudur, B.K. (2015) Two-stage Sampled Learning Theory on Distributions, International Conference on Artificial Intelligence and Statistics (AISTATS). L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random Fourier Representations Module 5: Machine Learning 30 / 30