Convergence Rates of Kernel Quadrature Rules
|
|
- Jesse Hoover
- 5 years ago
- Views:
Transcription
1 Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015
2 Outline Introduction Quadrature rules Kernel quadrature rules (a.k.a. Bayes-Hermite quadrature) Generic analysis of kernel quadrature rules Eigenvalues of covariance operator Optimal sampling distribution Extensions Link with random feature approximations Full function approximation Herding
3 Quadrature Given a square-integrable function g : X R, and a probability measure dρ, approximating X h(x)g(x)dρ(x) α i h(x i ) i=1 for all functions h : X R in a certain function space F Many applications Main goal: Choice of support points x i X and weights α i R Control of error decay as n grows, uniformly over h
4 Quadrature - Existing rules Generic baseline: Monte-Carlo x i sampled from dρ(x), α i = g(x i )/n, with error O(1/ n)
5 Quadrature - Existing rules Generic baseline: Monte-Carlo x i sampled from dρ(x), α i = g(x i )/n, with error O(1/ n) One-dimensional integrals X = [0, 1] Trapezoidal or Simpson s rules: O(1/n 2 ) for f with uniformly bounded second derivatives (Cruz-Uribe and Neugebauer, 2002) Gaussian quadrature (based on orthogonal polynomials): exact on polynomials or degree 2n 1 (Hildebrand, 1987) Quasi-monte carlo: O(1/n) for functions with bounded variation (Morokoff and Caflisch, 1994)
6 Quadrature - Existing rules Generic baseline: Monte-Carlo x i sampled from dρ(x), α i = g(x i )/n, with error O(1/ n) One-dimensional integrals X = [0, 1] Multi-dimensional X = [0,1] d All uni-dimensional methods above generalize for small d Bayes-Hermite quadrature (O Hagan, 1991) Kernel quadrature (Smola et al., 2007)
7 Quadrature - Existing rules Generic baseline: Monte-Carlo x i sampled from dρ(x), α i = g(x i )/n, with error O(1/ n) One-dimensional integrals X = [0, 1] Multi-dimensional X = [0,1] d All uni-dimensional methods above generalize for small d Bayes-Hermite quadrature (O Hagan, 1991) Kernel quadrature (Smola et al., 2007) Extensions to less-standard sets X Only require a positive-definite kernel on X
8 Quadrature - Existing theoretical results Key reference: Novak (1988) Sobolev space on [0,1] (f (s) square-integrable) Minimax error decay: O(n s ) Sobolev space on [0,1] d All partial derivatives with total order s are square-integrable Minimax error decay: O(n s/d ) Sobolev space on the hypersphere in d+1 dimensions Minimax error decay: O(n 2s/d ) - A single result for all situations?
9 Quadrature - Existing theoretical results Key reference: Novak (1988) Sobolev space on [0,1] (f (s) square-integrable) Minimax error decay: O(n s ) Sobolev space on [0,1] d All partial derivatives with total order s are square-integrable Minimax error decay: O(n s/d ) Sobolev space on the hypersphere in d+1 dimensions Minimax error decay: O(n 2s/d ) A single result for all situations?
10 Kernels and reproducing kernel Hilbert spaces Input space X Positive definite kernel k : X X R Reproducing kernel Hilbert space (RKHS) F Space of functions f : X R spanned by Φ(x) = k(,x), x X Reproducing properties f,k(,x) F = f(x) and k(,y),k(,x) F = k(x,y) Example: Sobolev spaces, e.g., k(x,y) = exp( x y )
11 Quadrature in RKHS Goal: given a distribution dρ on X and g L 2 (dρ), estimation of X h(x)g(x)dρ(x) by α j h(x j ) for any h F j=1
12 Quadrature in RKHS Goal: given a distribution dρ on X and g L 2 (dρ), estimation of X h(x)g(x)dρ(x) by j=1 α j h(x j ) for any h F X k(,x),h F g(x)dρ(x) by α j k(,x j ),h F for any h F j=1 Error = h, k(x, )g(x)dρ(x) X h F α j k(x j, ) j=1 X k(x, )g(x)dρ(x) n j=1 α jk(x j, ) F
13 Quadrature in RKHS Goal: given a distribution dρ on X and g L 2 (dρ), estimation of X h(x)g(x)dρ(x) by j=1 α j h(x j ) for any h F Worst-case bound: sup h F 1 equal to (Smola et al., 2007) X X k(x, )g(x)dρ(x) h(x)g(x)dρ(x) α j k(x j, ) j=1 j=1 F α j h(x j ) is
14 Quadrature in RKHS Goal: find x 1,...,x n such that k(x, )g(x)dρ(x) X is as small as possible j=1 α j k(x j, ) F Computation of weights α R n given x i X, i = 1,...,n Need precise evaluations of µ(y) = X k(x,y)g(x)dρ(x) (Smola et al., 2007; Oates and Girolami, 2015) Minimize 2 n i=1 µ(x i)α i + n i,j=1 α iα j k(x i,x j ) Choice of support points x i Optimization (Chen et al., 2010; Bach et al., 2012) Sampling
15 Quadrature in RKHS Goal: find x 1,...,x n such that k(x, )g(x)dρ(x) X is as small as possible j=1 α j k(x j, ) F Computation of weights α R n given x i X, i = 1,...,n Need precise evaluations of µ(y) = X k(x,y)g(x)dρ(x) (Smola et al., 2007; Oates and Girolami, 2015) Minimize 2 n i=1 µ(x i)α i + n i,j=1 α iα j k(x i,x j ) Choice of support points x i Optimization (Chen et al., 2010; Bach et al., 2012) Sampling
16 Outline Introduction Quadrature rules Kernel quadrature rules (a.k.a. Bayes-Hermite quadrature) Generic analysis of kernel quadrature rules Eigenvalues of covariance operator Optimal sampling distribution Extensions Link with random feature approximations Full function approximation Herding
17 Generic analysis of kernel quadrature rules Support points x i sampled i.i.d. from a density q w.r.t. dρ Importance weighted quadrature and error bound β i q(x i ) 1/2k(,x i) k(,x)g(x)dρ(x) i=1 Robustness to noise: β 2 2 small X 2 F
18 Generic analysis of kernel quadrature rules Support points x i sampled i.i.d. from a density q w.r.t. dρ Importance weighted quadrature and error bound β i q(x i ) 1/2k(,x i) k(,x)g(x)dρ(x) i=1 Robustness to noise: β 2 2 small Approximation of a function µ = k(,x)g(x)dρ(x) by random X elements from an RKHS Classical tool: eigenvalues of covariance operator Mercer decomposition (Mercer, 1909) k(x,y) = m 1µ m e m (x)e m (y) X 2 F
19 Upper-bound Assumptions x 1,...,x n X sampled i.i.d. with density q(x)= µ m 1 m +λ e m(x) 2 Degrees of freedom d(λ) = m 1 µ m µ m +λ µ m
20 Upper-bound Assumptions x 1,...,x n X sampled i.i.d. with density q(x)= µ m 1 m +λ e m(x) 2 Degrees of freedom d(λ) = m 1 µ m µ m +λ Bound: for any δ > 0, if n 4+6d(λ)log 4d(λ), with probability δ greater than 1 δ, we have sup g L2 (dρ) 1 inf β n i=1 β i q(x i ) 1/2k(,x i) X µ m k(,x)g(x)dρ(x) 2 F 4λ
21 Upper-bound Assumptions x 1,...,x n X sampled i.i.d. with density q(x)= µ m 1 m +λ e m(x) 2 Degrees of freedom d(λ) = m 1 µ m µ m +λ Bound: for any δ > 0, if n 4+6d(λ)log 4d(λ), with probability δ greater than 1 δ, we have sup g L2 (dρ) 1 inf β n i=1 β i q(x i ) 1/2k(,x i) X µ m k(,x)g(x)dρ(x) Proof technique (Bach, 2013; El Alaoui and Mahoney, 2014) Explicit β obtained by regularizing by λ β 2 2 Concentration inequalities in Hilbert space 2 F 4λ
22 Upper-bound Assumptions x 1,...,x n X sampled i.i.d. with density q(x)= µ m 1 m +λ e m(x) 2 Degrees of freedom d(λ) = m 1 µ m µ m +λ Bound: for any δ > 0, if n 5d(λ)log 16d(λ), with probability δ greater than 1 δ, we have sup g L2 (dρ) 1 inf β n i=1 β i q(x i ) 1/2k(,x i) X Matching lower bound (any family of x i ) µ m k(,x)g(x)dρ(x) Key features: (a) degrees of freedom d(λ) and (b) distribution q 2 F 4λ
23 Degrees of freedom Degrees of freedom d(λ) = m 1 µ m µ m +λ Traditional quantity for analysis of kernel methods (Hastie and Tibshirani, 1990) If eigenvalues decay as µ m m α, α > 1, then d(λ) #{m,µ m λ} λ 1/α Sobolev spaces in dimension d and order s: α = 2s/d Take-home : need #{m,µ m λ} features for squared error λ Take-home : After n sampled points, squared error µ n
24 Degrees of freedom Degrees of freedom d(λ) = m 1 µ m µ m +λ Traditional quantity for analysis of kernel methods (Hastie and Tibshirani, 1990) If eigenvalues decay as µ m m α, α > 1, then d(λ) #{m,µ m λ} λ 1/α Sobolev spaces in dimension d and order s: α = 2s/d Take-home : need #{m,µ m λ} features for squared error λ Take-home : After n sampled points, squared error µ n
25 Optimized sampling distribution µ m Density q(x)= µ m 1 m +λ e m(x) 2 Relationship to leverage scores (Mahoney, 2011) Hard to compute in generic situations Possible approximations (Drineas et al., 2012)
26 Optimized sampling distribution µ m Density q(x)= µ m 1 m +λ e m(x) 2 Relationship to leverage scores (Mahoney, 2011) Hard to compute in generic situations Possible approximations (Drineas et al., 2012) Sobolev spaces on [0,1] d or hypersphere Equal to uniform distribution Matches known minimax rates
27 Outline Introduction Quadrature rules Kernel quadrature rules (a.k.a. Bayes-Hermite quadrature) Generic analysis of kernel quadrature rules Eigenvalues of covariance operator Optimal sampling distribution Extensions Link with random feature approximations Full function approximation Herding
28 Link with random feature expansions Some kernels are naturally expressed as expectations [ ] 1 k(x,y) = E v ϕ(v,x)ϕ(v,y) ϕ(v i,x)ϕ(v i,y) n Neural networks with infinitely random hidden units (Neal, 1995) Fourier features (Rahimi and Recht, 2007) Main question: minimal number n of features for a given approximation quality i=1
29 Link with random feature expansions Some kernels are naturally expressed as expectations [ ] 1 k(x,y) = E v ϕ(v,x)ϕ(v,y) ϕ(v i,x)ϕ(v i,y) n Neural networks with infinitely random hidden units (Neal, 1995) Fourier features (Rahimi and Recht, 2007) Main question: minimal number n of features for a given approximation quality Kernel quadrature is a subcase of random feature expansions i=1 Mercer decomposition: k(x,y) = m 1 µ me m (x)e m (y) ϕ(v,x) = m 1 µ1/2 m e m (x)e m (v), with v X sampled from dρ [ ] E v ϕ(v,x)ϕ(v,y) = (µ m µ m ) 1/2 e m (x)e m (y) ( Ee m (v)e m (v) ) m,m 1
30 Full function approximation Fact: given support points x i, quadrature rule for X h(x)g(x)dρ(x) has weights which are linear in g, that is α i = a i,g L2 (dρ) X h(x)g(x)dρ(x) i=1 α i h(x i ) = g,h i=1 h(x i )a i L 2 (dρ)
31 Full function approximation Fact: given support points x i, quadrature rule for X h(x)g(x)dρ(x) has weights which are linear in g, that is α i = a i,g L2 (dρ) X h(x)g(x)dρ(x) i=1 α i h(x i ) = g,h i=1 h(x i )a i L 2 (dρ) Uniform bound for g L2 (dρ) 1 approximation of h in L 2 (dρ) Recover result from Novak (1988) Approximation in RKHS norm not possible Approximation in L -norm incur loss of performance of n
32 Full function approximation Fact: given support points x i, quadrature rule for X h(x)g(x)dρ(x) has weights which are linear in g, that is α i = a i,g L2 (dρ) X h(x)g(x)dρ(x) i=1 α i h(x i ) = g,h i=1 h(x i )a i L 2 (dρ) Uniform bound for g L2 (dρ) 1 approximation of h in L 2 (dρ) Recover result from Novak (1988) Approximation in RKHS norm not possible Approximation in L -norm incur loss of performance of n Adaptivity to smoother functions If h is (a bit) smoother, then the rate is still optimal
33 Sobolev spaces on [0,1] Quadrature rule obtained from order t, applied to order s log 10 (squared error) Test functions : s = 1 t=1 : 2.0 t=2 : 2.5 t=3 : 3.3 t=4 : 5.2 log 10 (squared error) Test functions : s = 2 t=1 : 3.9 t=2 : 3.9 t=3 : 4.0 t=4 : log (n) log (n) 10 0 Test functions : s = 3 0 Test functions : s = 4 log 10 (squared error) t=1 : 4.3 t=2 : 5.8 t=3 : 5.8 t=4 : log (n) 10 log 10 (squared error) t=1 : 4.2 t=2 : t=3 : 7.5 t=4 : log (n) 10 2
34 Herding Choosing points through optimization (Chen et al., 2010) Interpretation as Frank-Wolfe optimization (Bach, Lacoste- Julien, and Obozinski, 2012) Convex weights α Extra-projection step (Briol et al., 2015)
35 Herding Choosing points through optimization (Chen et al., 2010) Interpretation as Frank-Wolfe optimization (Bach, Lacoste- Julien, and Obozinski, 2012) Convex weights α Extra-projection step (Briol et al., 2015) Open problem: No true convergence rate (Bach et al., 2012) In finite dimension: exponential convergence rate depending on the existence of a certain constant c > 0 In infinite dimension: the constant c is provably equal to zero (exponential would contradict lower bounds)
36 Conclusion Sharp analysis of kernel quadrature rules Spectrum of the covariance operator (µ m ) n points sampled i.i.d. from a well chosen distribution Error of µ n Applies to all X with a positive definite kernel
37 Conclusion Sharp analysis of kernel quadrature rules Spectrum of the covariance operator (µ m ) n points sampled i.i.d. from a well chosen distribution Error of µ n Applies to all X with a positive definite kernel Extensions Computationally efficient ways to sample from optimized distribution (Drineas et al., 2012) Anytime sampling From quadrature to maximization (Novak, 1988) Quasi-random sampling (Yang et al., 2014; Oates and Girolami, 2015)
38 References F. Bach. Sharp analysis of low-rank kernel matrix approximations. In Proceedings of the International Conference on Learning Theory (COLT), Francis Bach, Simon Lacoste-Julien, and Guillaume Obozinski. On the equivalence between herding and conditional gradient algorithms. arxiv preprint arxiv: , François-Xavier Briol, Chris J Oates, Mark Girolami, and Michael A Osborne. Frank-wolfe bayesian quadrature: Probabilistic integration with theoretical guarantees. In Adv. NIPS, Y. Chen, M. Welling, and A. Smola. Super-samples from kernel herding. In Proc. UAI, D. Cruz-Uribe and C. J. Neugebauer. Sharp error bounds for the trapezoidal rule and Simpson s rule. Journal of Inequalities in Pure and Applied Mathematics, 3(4), P. Drineas, M. Magdon-Ismail, M. W. Mahoney, and D. P. Woodruff. Fast approximation of matrix coherence and statistical leverage. Journal of Machine Learning Research, 13: , A. El Alaoui and M. W. Mahoney. Fast randomized kernel methods with statistical guarantees. Technical Report , arxiv, T. J. Hastie and R. J. Tibshirani. Generalized Additive Models. Chapman & Hall, F. B. Hildebrand. Introduction to Numerical Analysis. Courier Dover Publications, M. W. Mahoney. Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning, 3(2): , J. Mercer. Functions of positive and negative type, and their connection with the theory of integral
39 equations. Philosophical Transactions of the Royal Society of London, Series A., 209: , W. J. Morokoff and R. E. Caflisch. Quasi-random sequences and their discrepancies. SIAM Journal on Scientific Computing, 15(6): , R. M. Neal. Bayesian Learning for Neural Networks. PhD thesis, University of Toronto, E. Novak. Deterministic and Stochastic Error Bounds in Numerical Analysis. Springer-Verlag, C. J. Oates and M. Girolami. Variance reduction for quasi-monte-carlo. Technical Report , arxiv, A. O Hagan. Bayes-Hermite quadrature. Journal of statistical planning and inference, 29(3): , A. Rahimi and B. Recht. Random features for large-scale kernel machines. Advances in neural information processing systems, 20: , A. Smola, A. Gretton, L. Song, and B. Schölkopf. A Hilbert space embedding for distributions. In Algorithmic Learning Theory, pages Springer, J. Yang, V. Sindhwani, H. Avron, and M. Mahoney. Quasi-Monte Carlo feature maps for shift-invariant kernels. In Proceedings of the International Conference on Machine Learning (ICML), 2014.
Convergence guarantees for kernel-based quadrature rules in misspecified settings
Convergence guarantees for kernel-based quadrature rules in misspecified settings Motonobu Kanagawa, Bharath K Sriperumbudur, Kenji Fukumizu The Institute of Statistical Mathematics, Tokyo 19-856, Japan
More informationOn the Equivalence between Herding and Conditional Gradient Algorithms
On the Equivalence between Herding and Conditional Gradient Algorithms Francis Bach, Simon Lacoste-Julien, Guillaume Obozinski firstname.lastname@inria.fr Sierra project-team, INRIA, Département d Informatique
More informationLess is More: Computational Regularization by Subsampling
Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro
More informationStatistical Optimality of Stochastic Gradient Descent through Multiple Passes
Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Loucas Pillaud-Vivien
More informationKernel Learning via Random Fourier Representations
Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random
More informationStatistical Convergence of Kernel CCA
Statistical Convergence of Kernel CCA Kenji Fukumizu Institute of Statistical Mathematics Tokyo 106-8569 Japan fukumizu@ism.ac.jp Francis R. Bach Centre de Morphologie Mathematique Ecole des Mines de Paris,
More informationStochastic optimization in Hilbert spaces
Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert
More informationLess is More: Computational Regularization by Subsampling
Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro
More informationApproximate Kernel PCA with Random Features
Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,
More informationKernel Sequential Monte Carlo
Kernel Sequential Monte Carlo Ingmar Schuster (Paris Dauphine) Heiko Strathmann (University College London) Brooks Paige (Oxford) Dino Sejdinovic (Oxford) * equal contribution April 25, 2016 1 / 37 Section
More informationWorst-Case Bounds for Gaussian Process Models
Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis
More informationKernel adaptive Sequential Monte Carlo
Kernel adaptive Sequential Monte Carlo Ingmar Schuster (Paris Dauphine) Heiko Strathmann (University College London) Brooks Paige (Oxford) Dino Sejdinovic (Oxford) December 7, 2015 1 / 36 Section 1 Outline
More informationStochastic gradient descent and robustness to ill-conditioning
Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,
More informationOptimal kernel methods for large scale learning
Optimal kernel methods for large scale learning Alessandro Rudi INRIA - École Normale Supérieure, Paris joint work with Luigi Carratino, Lorenzo Rosasco 6 Mar 2018 École Polytechnique Learning problem
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,
More informationKernel-based Approximation. Methods using MATLAB. Gregory Fasshauer. Interdisciplinary Mathematical Sciences. Michael McCourt.
SINGAPORE SHANGHAI Vol TAIPEI - Interdisciplinary Mathematical Sciences 19 Kernel-based Approximation Methods using MATLAB Gregory Fasshauer Illinois Institute of Technology, USA Michael McCourt University
More informationNearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2
Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington October 26, 2017 2017 Kevin Jamieson 2 Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July
More informationApproximate Kernel Methods
Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression
More informationKernel methods for Bayesian inference
Kernel methods for Bayesian inference Arthur Gretton Gatsby Computational Neuroscience Unit Lancaster, Nov. 2014 Motivating Example: Bayesian inference without a model 3600 downsampled frames of 20 20
More informationUnsupervised dimensionality reduction
Unsupervised dimensionality reduction Guillaume Obozinski Ecole des Ponts - ParisTech SOCN course 2014 Guillaume Obozinski Unsupervised dimensionality reduction 1/30 Outline 1 PCA 2 Kernel PCA 3 Multidimensional
More informationCIS 520: Machine Learning Oct 09, Kernel Methods
CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationKernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton
Kernel Methods Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra,
More informationAnnouncements. Proposals graded
Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:
More informationFrank-Wolfe Bayesian Quadrature: Probabilistic Integration with Theoretical Guarantees
Frank-Wolfe Bayesian Quadrature: Probabilistic Integration with Theoretical Guarantees François-Xavier Briol Department of Statistics University of Warwick f-x.briol@warwick.ac.uk Mark Girolami Department
More informationKernel Bayes Rule: Nonparametric Bayesian inference with kernels
Kernel Bayes Rule: Nonparametric Bayesian inference with kernels Kenji Fukumizu The Institute of Statistical Mathematics NIPS 2012 Workshop Confluence between Kernel Methods and Graphical Models December
More informationSupport Vector Machines
Wien, June, 2010 Paul Hofmarcher, Stefan Theussl, WU Wien Hofmarcher/Theussl SVM 1/21 Linear Separable Separating Hyperplanes Non-Linear Separable Soft-Margin Hyperplanes Hofmarcher/Theussl SVM 2/21 (SVM)
More informationTUM 2016 Class 3 Large scale learning by regularization
TUM 2016 Class 3 Large scale learning by regularization Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Learning problem Solve min w E(w), E(w) = dρ(x, y)l(w x, y) given (x 1, y 1 ),..., (x n, y n ) Beyond
More informationTractable Upper Bounds on the Restricted Isometry Constant
Tractable Upper Bounds on the Restricted Isometry Constant Alex d Aspremont, Francis Bach, Laurent El Ghaoui Princeton University, École Normale Supérieure, U.C. Berkeley. Support from NSF, DHS and Google.
More informationKernel Principal Component Analysis
Kernel Principal Component Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr
More informationRandom Projections. Lopez Paz & Duvenaud. November 7, 2013
Random Projections Lopez Paz & Duvenaud November 7, 2013 Random Outline The Johnson-Lindenstrauss Lemma (1984) Random Kitchen Sinks (Rahimi and Recht, NIPS 2008) Fastfood (Le et al., ICML 2013) Why random
More informationClassical quadrature rules via Gaussian processes
Classical quadrature rules via Gaussian processes Toni Karvonen and Simo Särkkä This article has been accepted for publication in Proceedings of 2th IEEE International Workshop on Machine Learning for
More informationDiffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)
Diffeomorphic Warping Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel) What Manifold Learning Isn t Common features of Manifold Learning Algorithms: 1-1 charting Dense sampling Geometric Assumptions
More informationThe Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space
The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space Robert Jenssen, Deniz Erdogmus 2, Jose Principe 2, Torbjørn Eltoft Department of Physics, University of Tromsø, Norway
More informationRecovering Distributions from Gaussian RKHS Embeddings
Motonobu Kanagawa Graduate University for Advanced Studies kanagawa@ism.ac.jp Kenji Fukumizu Institute of Statistical Mathematics fukumizu@ism.ac.jp Abstract Recent advances of kernel methods have yielded
More informationKernels A Machine Learning Overview
Kernels A Machine Learning Overview S.V.N. Vishy Vishwanathan vishy@axiom.anu.edu.au National ICT of Australia and Australian National University Thanks to Alex Smola, Stéphane Canu, Mike Jordan and Peter
More informationHierarchical kernel learning
Hierarchical kernel learning Francis Bach Willow project, INRIA - Ecole Normale Supérieure May 2010 Outline Supervised learning and regularization Kernel methods vs. sparse methods MKL: Multiple kernel
More informationOnline Gradient Descent Learning Algorithms
DISI, Genova, December 2006 Online Gradient Descent Learning Algorithms Yiming Ying (joint work with Massimiliano Pontil) Department of Computer Science, University College London Introduction Outline
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new
More informationDistribution Regression: A Simple Technique with Minimax-optimal Guarantee
Distribution Regression: A Simple Technique with Minimax-optimal Guarantee (CMAP, École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department,
More informationStein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm
Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu and Dilin Wang NIPS 2016 Discussion by Yunchen Pu March 17, 2017 March 17, 2017 1 / 8 Introduction Let x R d
More informationCS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu
CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu Feature engineering is hard 1. Extract informative features from domain knowledge
More informationDivide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates
: A Distributed Algorithm with Minimax Optimal Rates Yuchen Zhang, John C. Duchi, Martin Wainwright (UC Berkeley;http://arxiv.org/pdf/1305.509; Apr 9, 014) Gatsby Unit, Tea Talk June 10, 014 Outline Motivation.
More informationKernel Method: Data Analysis with Positive Definite Kernels
Kernel Method: Data Analysis with Positive Definite Kernels 2. Positive Definite Kernel and Reproducing Kernel Hilbert Space Kenji Fukumizu The Institute of Statistical Mathematics. Graduate University
More informationOslo Class 2 Tikhonov regularization and kernels
RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n
More informationChapter 9. Support Vector Machine. Yongdai Kim Seoul National University
Chapter 9. Support Vector Machine Yongdai Kim Seoul National University 1. Introduction Support Vector Machine (SVM) is a classification method developed by Vapnik (1996). It is thought that SVM improved
More informationOnline Kernel PCA with Entropic Matrix Updates
Dima Kuzmin Manfred K. Warmuth Computer Science Department, University of California - Santa Cruz dima@cse.ucsc.edu manfred@cse.ucsc.edu Abstract A number of updates for density matrices have been developed
More informationParcimonie en apprentissage statistique
Parcimonie en apprentissage statistique Guillaume Obozinski Ecole des Ponts - ParisTech Journée Parcimonie Fédération Charles Hermite, 23 Juin 2014 Parcimonie en apprentissage 1/44 Classical supervised
More informationOptimization for Machine Learning
Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS
More informationDistribution Regression
Zoltán Szabó (École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) Dagstuhl Seminar 16481
More informationA Linear-Time Kernel Goodness-of-Fit Test
A Linear-Time Kernel Goodness-of-Fit Test Wittawat Jitkrittum 1; Wenkai Xu 1 Zoltán Szabó 2 NIPS 2017 Best paper! Kenji Fukumizu 3 Arthur Gretton 1 wittawatj@gmail.com 1 Gatsby Unit, University College
More informationGradient-based Sampling: An Adaptive Importance Sampling for Least-squares
Gradient-based Sampling: An Adaptive Importance Sampling for Least-squares Rong Zhu Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China. rongzhu@amss.ac.cn Abstract
More informationBayesian Support Vector Machines for Feature Ranking and Selection
Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationIntroduction to Three Paradigms in Machine Learning. Julien Mairal
Introduction to Three Paradigms in Machine Learning Julien Mairal Inria Grenoble Yerevan, 208 Julien Mairal Introduction to Three Paradigms in Machine Learning /25 Optimization is central to machine learning
More informationMaximum Mean Discrepancy
Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia
More informationSTA 4273H: Sta-s-cal Machine Learning
STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our
More informationKernel-Based Contrast Functions for Sufficient Dimension Reduction
Kernel-Based Contrast Functions for Sufficient Dimension Reduction Michael I. Jordan Departments of Statistics and EECS University of California, Berkeley Joint work with Kenji Fukumizu and Francis Bach
More informationOn non-parametric robust quantile regression by support vector machines
On non-parametric robust quantile regression by support vector machines Andreas Christmann joint work with: Ingo Steinwart (Los Alamos National Lab) Arnout Van Messem (Vrije Universiteit Brussel) ERCIM
More informationDistribution Regression with Minimax-Optimal Guarantee
Distribution Regression with Minimax-Optimal Guarantee (Gatsby Unit, UCL) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More informationKernel methods for comparing distributions, measuring dependence
Kernel methods for comparing distributions, measuring dependence Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Principal component analysis Given a set of M centered observations
More informationGaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008
Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:
More informationAppendices: Stochastic Backpropagation and Approximate Inference in Deep Generative Models
Appendices: Stochastic Backpropagation and Approximate Inference in Deep Generative Models Danilo Jimenez Rezende Shakir Mohamed Daan Wierstra Google DeepMind, London, United Kingdom DANILOR@GOOGLE.COM
More informationKernel methods and the exponential family
Kernel methods and the exponential family Stéphane Canu 1 and Alex J. Smola 2 1- PSI - FRE CNRS 2645 INSA de Rouen, France St Etienne du Rouvray, France Stephane.Canu@insa-rouen.fr 2- Statistical Machine
More informationScale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract
Scale-Invariance of Support Vector Machines based on the Triangular Kernel François Fleuret Hichem Sahbi IMEDIA Research Group INRIA Domaine de Voluceau 78150 Le Chesnay, France Abstract This paper focuses
More informationHilbert Space Representations of Probability Distributions
Hilbert Space Representations of Probability Distributions Arthur Gretton joint work with Karsten Borgwardt, Kenji Fukumizu, Malte Rasch, Bernhard Schölkopf, Alex Smola, Le Song, Choon Hui Teo Max Planck
More informationMTTTS16 Learning from Multiple Sources
MTTTS16 Learning from Multiple Sources 5 ECTS credits Autumn 2018, University of Tampere Lecturer: Jaakko Peltonen Lecture 6: Multitask learning with kernel methods and nonparametric models On this lecture:
More informationStatistical Machine Learning from Data
Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Support Vector Machines Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique
More informationKernel Methods. Outline
Kernel Methods Quang Nguyen University of Pittsburgh CS 3750, Fall 2011 Outline Motivation Examples Kernels Definitions Kernel trick Basic properties Mercer condition Constructing feature space Hilbert
More informationSupport Vector Machine
Support Vector Machine Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Linear Support Vector Machine Kernelized SVM Kernels 2 From ERM to RLM Empirical Risk Minimization in the binary
More informationNeutron inverse kinetics via Gaussian Processes
Neutron inverse kinetics via Gaussian Processes P. Picca Politecnico di Torino, Torino, Italy R. Furfaro University of Arizona, Tucson, Arizona Outline Introduction Review of inverse kinetics techniques
More informationProbabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms
Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms François Caron Department of Statistics, Oxford STATLEARN 2014, Paris April 7, 2014 Joint work with Adrien Todeschini,
More informationProbabilistic numerics for deep learning
Presenter: Shijia Wang Department of Engineering Science, University of Oxford rning (RLSS) Summer School, Montreal 2017 Outline 1 Introduction Probabilistic Numerics 2 Components Probabilistic modeling
More informationMathematical Methods for Data Analysis
Mathematical Methods for Data Analysis Massimiliano Pontil Istituto Italiano di Tecnologia and Department of Computer Science University College London Massimiliano Pontil Mathematical Methods for Data
More informationConnections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN
Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables Revised submission to IEEE TNN Aapo Hyvärinen Dept of Computer Science and HIIT University
More informationRandomized Algorithms
Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models
More information1. Mathematical Foundations of Machine Learning
1. Mathematical Foundations of Machine Learning 1.1 Basic Concepts Definition of Learning Definition [Mitchell (1997)] A computer program is said to learn from experience E with respect to some class of
More informationCan we do statistical inference in a non-asymptotic way? 1
Can we do statistical inference in a non-asymptotic way? 1 Guang Cheng 2 Statistics@Purdue www.science.purdue.edu/bigdata/ ONR Review Meeting@Duke Oct 11, 2017 1 Acknowledge NSF, ONR and Simons Foundation.
More informationSketching as a Tool for Numerical Linear Algebra
Sketching as a Tool for Numerical Linear Algebra (Part 2) David P. Woodruff presented by Sepehr Assadi o(n) Big Data Reading Group University of Pennsylvania February, 2015 Sepehr Assadi (Penn) Sketching
More informationConvergence of Eigenspaces in Kernel Principal Component Analysis
Convergence of Eigenspaces in Kernel Principal Component Analysis Shixin Wang Advanced machine learning April 19, 2016 Shixin Wang Convergence of Eigenspaces April 19, 2016 1 / 18 Outline 1 Motivation
More informationA GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong
A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES Wei Chu, S. Sathiya Keerthi, Chong Jin Ong Control Division, Department of Mechanical Engineering, National University of Singapore 0 Kent Ridge Crescent,
More informationElements of Positive Definite Kernel and Reproducing Kernel Hilbert Space
Elements of Positive Definite Kernel and Reproducing Kernel Hilbert Space Statistical Inference with Reproducing Kernel Hilbert Space Kenji Fukumizu Institute of Statistical Mathematics, ROIS Department
More informationLearning Gaussian Process Models from Uncertain Data
Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada
More informationConvergence of the Ensemble Kalman Filter in Hilbert Space
Convergence of the Ensemble Kalman Filter in Hilbert Space Jan Mandel Center for Computational Mathematics Department of Mathematical and Statistical Sciences University of Colorado Denver Parts based
More informationMemory Efficient Kernel Approximation
Si Si Department of Computer Science University of Texas at Austin ICML Beijing, China June 23, 2014 Joint work with Cho-Jui Hsieh and Inderjit S. Dhillon Outline Background Motivation Low-Rank vs. Block
More informationIntroduction to Gaussian Processes
Introduction to Gaussian Processes Neil D. Lawrence GPSS 10th June 2013 Book Rasmussen and Williams (2006) Outline The Gaussian Density Covariance from Basis Functions Basis Function Representations Constructing
More informationBasis Expansion and Nonlinear SVM. Kai Yu
Basis Expansion and Nonlinear SVM Kai Yu Linear Classifiers f(x) =w > x + b z(x) = sign(f(x)) Help to learn more general cases, e.g., nonlinear models 8/7/12 2 Nonlinear Classifiers via Basis Expansion
More informationLinear Regression and Discrimination
Linear Regression and Discrimination Kernel-based Learning Methods Christian Igel Institut für Neuroinformatik Ruhr-Universität Bochum, Germany http://www.neuroinformatik.rub.de July 16, 2009 Christian
More informationStein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm
Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu Dilin Wang Department of Computer Science Dartmouth College Hanover, NH 03755 {qiang.liu, dilin.wang.gr}@dartmouth.edu
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationKernel Methods. Machine Learning A W VO
Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance
More informationProbabilistic & Unsupervised Learning
Probabilistic & Unsupervised Learning Gaussian Processes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London
More informationMachine Learning and Related Disciplines
Machine Learning and Related Disciplines The 15 th Winter School of Statistical Physics POSCO International Center & POSTECH, Pohang 2018. 1. 8-12 (Mon.-Fri.) Yung-Kyun Noh Machine Learning Interdisciplinary
More informationPolynomial interpolation on the sphere, reproducing kernels and random matrices
Polynomial interpolation on the sphere, reproducing kernels and random matrices Paul Leopardi Mathematical Sciences Institute, Australian National University. For presentation at MASCOS Workshop on Stochastics
More informationDeep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści
Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?
More informationIntroduction to Support Vector Machines
Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear
More information