Convergence Rates of Kernel Quadrature Rules

Similar documents
Convergence guarantees for kernel-based quadrature rules in misspecified settings

On the Equivalence between Herding and Conditional Gradient Algorithms

Less is More: Computational Regularization by Subsampling

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes

Kernel Learning via Random Fourier Representations

Statistical Convergence of Kernel CCA

Stochastic optimization in Hilbert spaces

Less is More: Computational Regularization by Subsampling

Approximate Kernel PCA with Random Features

Kernel Sequential Monte Carlo

Worst-Case Bounds for Gaussian Process Models

Kernel adaptive Sequential Monte Carlo

Stochastic gradient descent and robustness to ill-conditioning

Optimal kernel methods for large scale learning

Beyond stochastic gradient descent for large-scale machine learning

Kernel-based Approximation. Methods using MATLAB. Gregory Fasshauer. Interdisciplinary Mathematical Sciences. Michael McCourt.

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Beyond stochastic gradient descent for large-scale machine learning

Approximate Kernel Methods

Kernel methods for Bayesian inference

Unsupervised dimensionality reduction

CIS 520: Machine Learning Oct 09, Kernel Methods

Nonparametric Bayesian Methods (Gaussian Processes)

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton

Announcements. Proposals graded

Frank-Wolfe Bayesian Quadrature: Probabilistic Integration with Theoretical Guarantees

Kernel Bayes Rule: Nonparametric Bayesian inference with kernels

Support Vector Machines

TUM 2016 Class 3 Large scale learning by regularization

Tractable Upper Bounds on the Restricted Isometry Constant

Kernel Principal Component Analysis

Random Projections. Lopez Paz & Duvenaud. November 7, 2013

Classical quadrature rules via Gaussian processes

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space

Recovering Distributions from Gaussian RKHS Embeddings

Kernels A Machine Learning Overview

Hierarchical kernel learning

Online Gradient Descent Learning Algorithms

Beyond stochastic gradient descent for large-scale machine learning

Distribution Regression: A Simple Technique with Minimax-optimal Guarantee

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates

Kernel Method: Data Analysis with Positive Definite Kernels

Oslo Class 2 Tikhonov regularization and kernels

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Online Kernel PCA with Entropic Matrix Updates

Parcimonie en apprentissage statistique

Optimization for Machine Learning

Distribution Regression

A Linear-Time Kernel Goodness-of-Fit Test

Gradient-based Sampling: An Adaptive Importance Sampling for Least-squares

Bayesian Support Vector Machines for Feature Ranking and Selection

ECE521 week 3: 23/26 January 2017

Introduction to Three Paradigms in Machine Learning. Julien Mairal

Maximum Mean Discrepancy

STA 4273H: Sta-s-cal Machine Learning

Kernel-Based Contrast Functions for Sufficient Dimension Reduction

On non-parametric robust quantile regression by support vector machines

Distribution Regression with Minimax-Optimal Guarantee

STA414/2104 Statistical Methods for Machine Learning II

Kernel methods for comparing distributions, measuring dependence

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Appendices: Stochastic Backpropagation and Approximate Inference in Deep Generative Models

Kernel methods and the exponential family

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract

Hilbert Space Representations of Probability Distributions

MTTTS16 Learning from Multiple Sources

Statistical Machine Learning from Data

Kernel Methods. Outline

Support Vector Machine

Neutron inverse kinetics via Gaussian Processes

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

Probabilistic numerics for deep learning

Mathematical Methods for Data Analysis

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

Randomized Algorithms

1. Mathematical Foundations of Machine Learning

Can we do statistical inference in a non-asymptotic way? 1

Sketching as a Tool for Numerical Linear Algebra

Convergence of Eigenspaces in Kernel Principal Component Analysis

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Elements of Positive Definite Kernel and Reproducing Kernel Hilbert Space

Learning Gaussian Process Models from Uncertain Data

Convergence of the Ensemble Kalman Filter in Hilbert Space

Memory Efficient Kernel Approximation

Introduction to Gaussian Processes

Basis Expansion and Nonlinear SVM. Kai Yu

Linear Regression and Discrimination

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Reproducing Kernel Hilbert Spaces

Kernel Methods. Machine Learning A W VO

Probabilistic & Unsupervised Learning

Machine Learning and Related Disciplines

Polynomial interpolation on the sphere, reproducing kernels and random matrices

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Introduction to Support Vector Machines

Transcription:

Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015

Outline Introduction Quadrature rules Kernel quadrature rules (a.k.a. Bayes-Hermite quadrature) Generic analysis of kernel quadrature rules Eigenvalues of covariance operator Optimal sampling distribution Extensions Link with random feature approximations Full function approximation Herding

Quadrature Given a square-integrable function g : X R, and a probability measure dρ, approximating X h(x)g(x)dρ(x) α i h(x i ) i=1 for all functions h : X R in a certain function space F Many applications Main goal: Choice of support points x i X and weights α i R Control of error decay as n grows, uniformly over h

Quadrature - Existing rules Generic baseline: Monte-Carlo x i sampled from dρ(x), α i = g(x i )/n, with error O(1/ n)

Quadrature - Existing rules Generic baseline: Monte-Carlo x i sampled from dρ(x), α i = g(x i )/n, with error O(1/ n) One-dimensional integrals X = [0, 1] Trapezoidal or Simpson s rules: O(1/n 2 ) for f with uniformly bounded second derivatives (Cruz-Uribe and Neugebauer, 2002) Gaussian quadrature (based on orthogonal polynomials): exact on polynomials or degree 2n 1 (Hildebrand, 1987) Quasi-monte carlo: O(1/n) for functions with bounded variation (Morokoff and Caflisch, 1994)

Quadrature - Existing rules Generic baseline: Monte-Carlo x i sampled from dρ(x), α i = g(x i )/n, with error O(1/ n) One-dimensional integrals X = [0, 1] Multi-dimensional X = [0,1] d All uni-dimensional methods above generalize for small d Bayes-Hermite quadrature (O Hagan, 1991) Kernel quadrature (Smola et al., 2007)

Quadrature - Existing rules Generic baseline: Monte-Carlo x i sampled from dρ(x), α i = g(x i )/n, with error O(1/ n) One-dimensional integrals X = [0, 1] Multi-dimensional X = [0,1] d All uni-dimensional methods above generalize for small d Bayes-Hermite quadrature (O Hagan, 1991) Kernel quadrature (Smola et al., 2007) Extensions to less-standard sets X Only require a positive-definite kernel on X

Quadrature - Existing theoretical results Key reference: Novak (1988) Sobolev space on [0,1] (f (s) square-integrable) Minimax error decay: O(n s ) Sobolev space on [0,1] d All partial derivatives with total order s are square-integrable Minimax error decay: O(n s/d ) Sobolev space on the hypersphere in d+1 dimensions Minimax error decay: O(n 2s/d ) - A single result for all situations?

Quadrature - Existing theoretical results Key reference: Novak (1988) Sobolev space on [0,1] (f (s) square-integrable) Minimax error decay: O(n s ) Sobolev space on [0,1] d All partial derivatives with total order s are square-integrable Minimax error decay: O(n s/d ) Sobolev space on the hypersphere in d+1 dimensions Minimax error decay: O(n 2s/d ) A single result for all situations?

Kernels and reproducing kernel Hilbert spaces Input space X Positive definite kernel k : X X R Reproducing kernel Hilbert space (RKHS) F Space of functions f : X R spanned by Φ(x) = k(,x), x X Reproducing properties f,k(,x) F = f(x) and k(,y),k(,x) F = k(x,y) Example: Sobolev spaces, e.g., k(x,y) = exp( x y )

Quadrature in RKHS Goal: given a distribution dρ on X and g L 2 (dρ), estimation of X h(x)g(x)dρ(x) by α j h(x j ) for any h F j=1

Quadrature in RKHS Goal: given a distribution dρ on X and g L 2 (dρ), estimation of X h(x)g(x)dρ(x) by j=1 α j h(x j ) for any h F X k(,x),h F g(x)dρ(x) by α j k(,x j ),h F for any h F j=1 Error = h, k(x, )g(x)dρ(x) X h F α j k(x j, ) j=1 X k(x, )g(x)dρ(x) n j=1 α jk(x j, ) F

Quadrature in RKHS Goal: given a distribution dρ on X and g L 2 (dρ), estimation of X h(x)g(x)dρ(x) by j=1 α j h(x j ) for any h F Worst-case bound: sup h F 1 equal to (Smola et al., 2007) X X k(x, )g(x)dρ(x) h(x)g(x)dρ(x) α j k(x j, ) j=1 j=1 F α j h(x j ) is

Quadrature in RKHS Goal: find x 1,...,x n such that k(x, )g(x)dρ(x) X is as small as possible j=1 α j k(x j, ) F Computation of weights α R n given x i X, i = 1,...,n Need precise evaluations of µ(y) = X k(x,y)g(x)dρ(x) (Smola et al., 2007; Oates and Girolami, 2015) Minimize 2 n i=1 µ(x i)α i + n i,j=1 α iα j k(x i,x j ) Choice of support points x i Optimization (Chen et al., 2010; Bach et al., 2012) Sampling

Quadrature in RKHS Goal: find x 1,...,x n such that k(x, )g(x)dρ(x) X is as small as possible j=1 α j k(x j, ) F Computation of weights α R n given x i X, i = 1,...,n Need precise evaluations of µ(y) = X k(x,y)g(x)dρ(x) (Smola et al., 2007; Oates and Girolami, 2015) Minimize 2 n i=1 µ(x i)α i + n i,j=1 α iα j k(x i,x j ) Choice of support points x i Optimization (Chen et al., 2010; Bach et al., 2012) Sampling

Outline Introduction Quadrature rules Kernel quadrature rules (a.k.a. Bayes-Hermite quadrature) Generic analysis of kernel quadrature rules Eigenvalues of covariance operator Optimal sampling distribution Extensions Link with random feature approximations Full function approximation Herding

Generic analysis of kernel quadrature rules Support points x i sampled i.i.d. from a density q w.r.t. dρ Importance weighted quadrature and error bound β i q(x i ) 1/2k(,x i) k(,x)g(x)dρ(x) i=1 Robustness to noise: β 2 2 small X 2 F

Generic analysis of kernel quadrature rules Support points x i sampled i.i.d. from a density q w.r.t. dρ Importance weighted quadrature and error bound β i q(x i ) 1/2k(,x i) k(,x)g(x)dρ(x) i=1 Robustness to noise: β 2 2 small Approximation of a function µ = k(,x)g(x)dρ(x) by random X elements from an RKHS Classical tool: eigenvalues of covariance operator Mercer decomposition (Mercer, 1909) k(x,y) = m 1µ m e m (x)e m (y) X 2 F

Upper-bound Assumptions x 1,...,x n X sampled i.i.d. with density q(x)= µ m 1 m +λ e m(x) 2 Degrees of freedom d(λ) = m 1 µ m µ m +λ µ m

Upper-bound Assumptions x 1,...,x n X sampled i.i.d. with density q(x)= µ m 1 m +λ e m(x) 2 Degrees of freedom d(λ) = m 1 µ m µ m +λ Bound: for any δ > 0, if n 4+6d(λ)log 4d(λ), with probability δ greater than 1 δ, we have sup g L2 (dρ) 1 inf β 2 2 4 n i=1 β i q(x i ) 1/2k(,x i) X µ m k(,x)g(x)dρ(x) 2 F 4λ

Upper-bound Assumptions x 1,...,x n X sampled i.i.d. with density q(x)= µ m 1 m +λ e m(x) 2 Degrees of freedom d(λ) = m 1 µ m µ m +λ Bound: for any δ > 0, if n 4+6d(λ)log 4d(λ), with probability δ greater than 1 δ, we have sup g L2 (dρ) 1 inf β 2 2 4 n i=1 β i q(x i ) 1/2k(,x i) X µ m k(,x)g(x)dρ(x) Proof technique (Bach, 2013; El Alaoui and Mahoney, 2014) Explicit β obtained by regularizing by λ β 2 2 Concentration inequalities in Hilbert space 2 F 4λ

Upper-bound Assumptions x 1,...,x n X sampled i.i.d. with density q(x)= µ m 1 m +λ e m(x) 2 Degrees of freedom d(λ) = m 1 µ m µ m +λ Bound: for any δ > 0, if n 5d(λ)log 16d(λ), with probability δ greater than 1 δ, we have sup g L2 (dρ) 1 inf β 2 2 4 n i=1 β i q(x i ) 1/2k(,x i) X Matching lower bound (any family of x i ) µ m k(,x)g(x)dρ(x) Key features: (a) degrees of freedom d(λ) and (b) distribution q 2 F 4λ

Degrees of freedom Degrees of freedom d(λ) = m 1 µ m µ m +λ Traditional quantity for analysis of kernel methods (Hastie and Tibshirani, 1990) If eigenvalues decay as µ m m α, α > 1, then d(λ) #{m,µ m λ} λ 1/α Sobolev spaces in dimension d and order s: α = 2s/d Take-home : need #{m,µ m λ} features for squared error λ Take-home : After n sampled points, squared error µ n

Degrees of freedom Degrees of freedom d(λ) = m 1 µ m µ m +λ Traditional quantity for analysis of kernel methods (Hastie and Tibshirani, 1990) If eigenvalues decay as µ m m α, α > 1, then d(λ) #{m,µ m λ} λ 1/α Sobolev spaces in dimension d and order s: α = 2s/d Take-home : need #{m,µ m λ} features for squared error λ Take-home : After n sampled points, squared error µ n

Optimized sampling distribution µ m Density q(x)= µ m 1 m +λ e m(x) 2 Relationship to leverage scores (Mahoney, 2011) Hard to compute in generic situations Possible approximations (Drineas et al., 2012)

Optimized sampling distribution µ m Density q(x)= µ m 1 m +λ e m(x) 2 Relationship to leverage scores (Mahoney, 2011) Hard to compute in generic situations Possible approximations (Drineas et al., 2012) Sobolev spaces on [0,1] d or hypersphere Equal to uniform distribution Matches known minimax rates

Outline Introduction Quadrature rules Kernel quadrature rules (a.k.a. Bayes-Hermite quadrature) Generic analysis of kernel quadrature rules Eigenvalues of covariance operator Optimal sampling distribution Extensions Link with random feature approximations Full function approximation Herding

Link with random feature expansions Some kernels are naturally expressed as expectations [ ] 1 k(x,y) = E v ϕ(v,x)ϕ(v,y) ϕ(v i,x)ϕ(v i,y) n Neural networks with infinitely random hidden units (Neal, 1995) Fourier features (Rahimi and Recht, 2007) Main question: minimal number n of features for a given approximation quality i=1

Link with random feature expansions Some kernels are naturally expressed as expectations [ ] 1 k(x,y) = E v ϕ(v,x)ϕ(v,y) ϕ(v i,x)ϕ(v i,y) n Neural networks with infinitely random hidden units (Neal, 1995) Fourier features (Rahimi and Recht, 2007) Main question: minimal number n of features for a given approximation quality Kernel quadrature is a subcase of random feature expansions i=1 Mercer decomposition: k(x,y) = m 1 µ me m (x)e m (y) ϕ(v,x) = m 1 µ1/2 m e m (x)e m (v), with v X sampled from dρ [ ] E v ϕ(v,x)ϕ(v,y) = (µ m µ m ) 1/2 e m (x)e m (y) ( Ee m (v)e m (v) ) m,m 1

Full function approximation Fact: given support points x i, quadrature rule for X h(x)g(x)dρ(x) has weights which are linear in g, that is α i = a i,g L2 (dρ) X h(x)g(x)dρ(x) i=1 α i h(x i ) = g,h i=1 h(x i )a i L 2 (dρ)

Full function approximation Fact: given support points x i, quadrature rule for X h(x)g(x)dρ(x) has weights which are linear in g, that is α i = a i,g L2 (dρ) X h(x)g(x)dρ(x) i=1 α i h(x i ) = g,h i=1 h(x i )a i L 2 (dρ) Uniform bound for g L2 (dρ) 1 approximation of h in L 2 (dρ) Recover result from Novak (1988) Approximation in RKHS norm not possible Approximation in L -norm incur loss of performance of n

Full function approximation Fact: given support points x i, quadrature rule for X h(x)g(x)dρ(x) has weights which are linear in g, that is α i = a i,g L2 (dρ) X h(x)g(x)dρ(x) i=1 α i h(x i ) = g,h i=1 h(x i )a i L 2 (dρ) Uniform bound for g L2 (dρ) 1 approximation of h in L 2 (dρ) Recover result from Novak (1988) Approximation in RKHS norm not possible Approximation in L -norm incur loss of performance of n Adaptivity to smoother functions If h is (a bit) smoother, then the rate is still optimal

Sobolev spaces on [0,1] Quadrature rule obtained from order t, applied to order s log 10 (squared error) 4 2 0 2 Test functions : s = 1 t=1 : 2.0 t=2 : 2.5 t=3 : 3.3 t=4 : 5.2 log 10 (squared error) 0 2 4 6 Test functions : s = 2 t=1 : 3.9 t=2 : 3.9 t=3 : 4.0 t=4 : 5.0 0 0.5 1 1.5 2 log (n) 10 0 0.5 1 1.5 2 log (n) 10 0 Test functions : s = 3 0 Test functions : s = 4 log 10 (squared error) 2 4 6 8 10 t=1 : 4.3 t=2 : 5.8 t=3 : 5.8 t=4 : 5.7 0 0.5 1 1.5 2 log (n) 10 log 10 (squared error) 2 4 6 8 10 t=1 : 4.2 t=2 : 7.6 12 t=3 : 7.5 t=4 : 7.4 14 0 0.5 1 1.5 log (n) 10 2

Herding Choosing points through optimization (Chen et al., 2010) Interpretation as Frank-Wolfe optimization (Bach, Lacoste- Julien, and Obozinski, 2012) Convex weights α Extra-projection step (Briol et al., 2015)

Herding Choosing points through optimization (Chen et al., 2010) Interpretation as Frank-Wolfe optimization (Bach, Lacoste- Julien, and Obozinski, 2012) Convex weights α Extra-projection step (Briol et al., 2015) Open problem: No true convergence rate (Bach et al., 2012) In finite dimension: exponential convergence rate depending on the existence of a certain constant c > 0 In infinite dimension: the constant c is provably equal to zero (exponential would contradict lower bounds)

Conclusion Sharp analysis of kernel quadrature rules Spectrum of the covariance operator (µ m ) n points sampled i.i.d. from a well chosen distribution Error of µ n Applies to all X with a positive definite kernel

Conclusion Sharp analysis of kernel quadrature rules Spectrum of the covariance operator (µ m ) n points sampled i.i.d. from a well chosen distribution Error of µ n Applies to all X with a positive definite kernel Extensions Computationally efficient ways to sample from optimized distribution (Drineas et al., 2012) Anytime sampling From quadrature to maximization (Novak, 1988) Quasi-random sampling (Yang et al., 2014; Oates and Girolami, 2015)

References F. Bach. Sharp analysis of low-rank kernel matrix approximations. In Proceedings of the International Conference on Learning Theory (COLT), 2013. Francis Bach, Simon Lacoste-Julien, and Guillaume Obozinski. On the equivalence between herding and conditional gradient algorithms. arxiv preprint arxiv:1203.4523, 2012. François-Xavier Briol, Chris J Oates, Mark Girolami, and Michael A Osborne. Frank-wolfe bayesian quadrature: Probabilistic integration with theoretical guarantees. In Adv. NIPS, 2015. Y. Chen, M. Welling, and A. Smola. Super-samples from kernel herding. In Proc. UAI, 2010. D. Cruz-Uribe and C. J. Neugebauer. Sharp error bounds for the trapezoidal rule and Simpson s rule. Journal of Inequalities in Pure and Applied Mathematics, 3(4), 2002. P. Drineas, M. Magdon-Ismail, M. W. Mahoney, and D. P. Woodruff. Fast approximation of matrix coherence and statistical leverage. Journal of Machine Learning Research, 13:3475 3506, 2012. A. El Alaoui and M. W. Mahoney. Fast randomized kernel methods with statistical guarantees. Technical Report 1411.0306, arxiv, 2014. T. J. Hastie and R. J. Tibshirani. Generalized Additive Models. Chapman & Hall, 1990. F. B. Hildebrand. Introduction to Numerical Analysis. Courier Dover Publications, 1987. M. W. Mahoney. Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning, 3(2):123 224, 2011. J. Mercer. Functions of positive and negative type, and their connection with the theory of integral

equations. Philosophical Transactions of the Royal Society of London, Series A., 209:415 446, 1909. W. J. Morokoff and R. E. Caflisch. Quasi-random sequences and their discrepancies. SIAM Journal on Scientific Computing, 15(6):1251 1279, 1994. R. M. Neal. Bayesian Learning for Neural Networks. PhD thesis, University of Toronto, 1995. E. Novak. Deterministic and Stochastic Error Bounds in Numerical Analysis. Springer-Verlag, 1988. C. J. Oates and M. Girolami. Variance reduction for quasi-monte-carlo. Technical Report 1501.03379, arxiv, 2015. A. O Hagan. Bayes-Hermite quadrature. Journal of statistical planning and inference, 29(3):245 260, 1991. A. Rahimi and B. Recht. Random features for large-scale kernel machines. Advances in neural information processing systems, 20:1177 1184, 2007. A. Smola, A. Gretton, L. Song, and B. Schölkopf. A Hilbert space embedding for distributions. In Algorithmic Learning Theory, pages 13 31. Springer, 2007. J. Yang, V. Sindhwani, H. Avron, and M. Mahoney. Quasi-Monte Carlo feature maps for shift-invariant kernels. In Proceedings of the International Conference on Machine Learning (ICML), 2014.