Distribution Regression: A Simple Technique with Minimax-optimal Guarantee
|
|
- Neil Harmon
- 5 years ago
- Views:
Transcription
1 Distribution Regression: A Simple Technique with Minimax-optimal Guarantee (CMAP, École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) Parisian Statistics Seminar March 27, 2017
2 Example: sustainability Goal: aerosol prediction = air pollution climate.
3 Example: sustainability Goal: aerosol prediction = air pollution climate. Prediction using labelled bags: bag := multi-spectral satellite measurements over an area, label := local aerosol value.
4 Example: existing methods Multi-instance learning: [Haussler, 1999, Gärtner et al., 2002] (set kernel):
5 Example: existing methods Multi-instance learning: [Haussler, 1999, Gärtner et al., 2002] (set kernel): sensible methods in regression: few, 1 restrictive technical conditions, 2 super-high resolution satellite image: would be needed.
6 One-page summary Contributions: 1 Practical: state-of-the-art accuracy (aerosol). 2 Theoretical: General bags: graphs, time series, texts,... Consistency of set kernel in regression (17-year-old open problem). How many samples/bag?
7 One-page summary Contributions: 1 Practical: state-of-the-art accuracy (aerosol). 2 Theoretical: General bags: graphs, time series, texts,... Consistency of set kernel in regression (17-year-old open problem). How many samples/bag?
8 Objects in the bags Examples: time-series modelling: user = set of time-series, computer vision: image = collection of patch vectors, NLP: corpus = bag of documents, network analysis: group of people = bag of friendship graphs,... Zolta n Szabo
9 Objects in the bags Examples: time-series modelling: user = set of time-series, computer vision: image = collection of patch vectors, NLP: corpus = bag of documents, network analysis: group of people = bag of friendship graphs,... Wider context (statistics): point estimation tasks. Zolta n Szabo
10 Regression on labelled bags Given: labelled bags: ẑ = {( ˆPi, y i )} l i=1, ˆP i : bag from P i, N := ˆP i. test bag: ˆP.
11 Regression on labelled bags Given: labelled bags: ẑ = {( )} l ˆP i, y i i=1, ˆP i : bag from P i, N := ˆP i. test bag: ˆP. Estimator: f λ ẑ 1 = arg min f H l l i=1 [ f ( µ ˆPi }{{} feature of ˆP i ) yi ] 2 + λ f 2 H.
12 Regression on labelled bags Given: labelled bags: ẑ = {( ˆPi, y i )} l i=1, ˆP i : bag from P i, N := ˆP i. test bag: ˆP. Estimator: f λ ẑ Prediction: 1 = arg min f H(K) l l i=1 ŷ ( ˆP) = g T (G + lλi) 1 y, [ f ( µ ˆPi ) yi ] 2 + λ f 2 H. g = [ K ( µ ˆP, µ ˆP i )], G = [ K ( µ ˆP i, µ ˆP j )], y = [yi ].
13 Regression on labelled bags Given: labelled bags: ẑ = {( ˆPi, y i )} l i=1, ˆP i : bag from P i, N := ˆP i. test bag: ˆP. Estimator: f λ ẑ Prediction: 1 = arg min f H(K) l l i=1 ŷ ( ˆP) = g T (G + lλi) 1 y, [ f ( µ ˆPi ) yi ] 2 + λ f 2 H. g = [ K ( µ ˆP, µ ˆP i )], G = [ K ( µ ˆP i, µ ˆP j )], y = [yi ]. Challenges 1 Inner product of distributions: K ( µ ˆP i, µ ˆP j ) =? 2 How many samples/bag?
14 Regression on labelled bags: similarity Let us define an inner product on distributions [ K(P, Q)]: 1 Set kernel: A = {a i } N i=1, B = {b j} N j=1. K(A, B) = 1 N 2 Remember: N 1 N k(a i, b j ) = ϕ(a i ), 1 N N i=1 }{{} feature of bag A i,j=1 N j=1 ϕ(b j ).
15 Regression on labelled bags: similarity Let us define an inner product on distributions [ K(P, Q)]: 1 Set kernel: A = {a i } N i=1, B = {b j} N j=1. K(A, B) = 1 N 2 N 1 N k(a i, b j ) = ϕ(a i ), 1 N N i=1 }{{} feature of bag A i,j=1 N j=1 ϕ(b j ). 2 Taking limit [Berlinet and Thomas-Agnan, 2004, Altun and Smola, 2006, Smola et al., 2007]: a P, b Q K(P, Q) = E a,b k(a, b) = E a ϕ(a), E b ϕ(b). }{{} feature of distribution P=:µ P Example (Gaussian kernel): k(a, b) = e a b 2 2 /(2σ2).
16 RKHS definition(s) Given: D set. Kernel: k(a, b) = ϕ(a), ϕ(b) F, F: Hilbert space.
17 RKHS definition(s) Given: D set. Kernel: k(a, b) = ϕ(a), ϕ(b) F, F: Hilbert space. RKHS: H R D Hilbert space, δ b (f ) = f (b) is continuous ( b).
18 RKHS definition(s) Given: D set. Kernel: k(a, b) = ϕ(a), ϕ(b) F, F: Hilbert space. RKHS: H R D Hilbert space, δ b (f ) = f (b) is continuous ( b). Reproducing kernel of an H R D Hilbert space, 1 k(, b) H,
19 RKHS definition(s) Given: D set. Kernel: k(a, b) = ϕ(a), ϕ(b) F, F: Hilbert space. RKHS: H R D Hilbert space, δ b (f ) = f (b) is continuous ( b). Reproducing kernel of an H R D Hilbert space, 1 k(, b) H, 2 f, k(, b) H = f (b). Note: k(a, b) = k(, a), k(, b) H.
20 RKHS definition(s) Given: D set. Kernel: k(a, b) = ϕ(a), ϕ(b) F, F: Hilbert space. RKHS: H R D Hilbert space, δ b (f ) = f (b) is continuous ( b). Reproducing kernel of an H R D Hilbert space, 1 k(, b) H, 2 f, k(, b) H = f (b). Note: k(a, b) = k(, a), k(, b) H. k : D D R sym. is pd. if G = [k(x i, x j )] n i,j=1 0 ( n, x i).
21 Kernel examples on D = R d, θ > 0 k G (a, b) = e a b 2 2 2θ 2, k e (a, b) = e a b 2 2θ 2, 1 1 k C (a, b) =, k t (a, b) = 1 + a b a b θ, θ 2 2 k p (a, b) = ( a, b + θ) p, k r (a, b) = 1 a b 2 2 a b θ, k i (a, b) = k M, 3 (a, b) = 2 k M, 5 (a, b) = 2 1 a b 22 + θ2, ( ) 3 a b θ ( 5 a b θ e 3 a b 2 θ, + 5 a b 2 2 3θ 2 ) e 5 a b 2 θ.
22 Regression on labelled bags: baseline Quality of estimator, baseline: R(f ) = E (µp,y) ρ[f (µ P ) y] 2, f ρ = best regressor. How many samples/bag to achieve the accuracy of f ρ? Possible? Assume (for a moment): f ρ H(K).
23 Our result: how many samples/bag Known [Caponnetto and De Vito, 2007]: best/realized rate ) R(fz λ ) R(f ρ ) = O (l bc bc+1, b size of the input space, c smoothness of f ρ.
24 Our result: how many samples/bag Known [Caponnetto and De Vito, 2007]: best/realized rate ) R(fz λ ) R(f ρ ) = O (l bc bc+1, b size of the input space, c smoothness of f ρ. Let N = Õ(l a ). N: size of the bags. l: number of bags. Our result If 2 a, then f λ ẑ attains the best achievable rate.
25 Our result: how many samples/bag Known [Caponnetto and De Vito, 2007]: best/realized rate ) R(fz λ ) R(f ρ ) = O (l bc bc+1, b size of the input space, c smoothness of f ρ. Let N = Õ(l a ). N: size of the bags. l: number of bags. Our result If 2 a, then f λ ẑ In fact, a = b(c+1) bc+1 attains the best achievable rate. < 2 is enough. Consequence: regression with set kernel is consistent.
26 Well-specified case: computational & statistical tradeoff Let N = Õ (l a ). Our result If b(c+1) bc+1 a, then R ( fẑ λ ) ) R (fρ ) = O (l bc bc+1.
27 Well-specified case: computational & statistical tradeoff Let N = Õ (l a ). Our result If b(c+1) bc+1 a, then R ( fẑ λ ) ) R (fρ ) = O (l bc bc+1. If a b(c+1) bc+1, then R ( fẑ λ ) ) R (fρ ) = O (l ac c+1. Meaning: smaller a: computational saving, but reduced statistical efficiency.
28 Well-specified case: computational & statistical tradeoff Let N = Õ (l a ). Our result If b(c+1) bc+1 a, then R ( fẑ λ ) ) R (fρ ) = O (l bc bc+1. If a b(c+1) bc+1, then R ( fẑ λ ) ) R (fρ ) = O (l ac c+1. Meaning: smaller a: computational saving, but reduced statistical efficiency. c b(c+1) bc+1 decreasing: easier problems smaller bags.
29 Why can we get consistency/rates? intuition Convergence of the mean embedding: ( ) µ P µ 1 ˆP H = O. N Hölder property of K (0 < L, 0 < h 1): K(, µ P ) K(, µ ˆP) H L µ P µ h ˆP H. f λ ẑ depends nicely on µ ˆP.
30 Valid similarities Recall: K(P, Q) = µ P, µ Q. K G K e K C e µ P µ Q 2 2θ 2 e µ P µ Q 2θ 2 ( 1 + µ P µ Q 2 /θ 2 ) 1 K t (1 + µ P µ Q θ) 1 ( µ P µ Q 2 + θ 2 ) 1 2 K i Functions of µ P µ Q computation: similar to set kernel.
31 Extensions 1 Misspecified setting (f ρ L 2 \H): Consistency: convergence to inf f H f f ρ L 2. Smoothness on f ρ : computational & statistical tradeoff.
32 Extensions 2 Vector-valued output: Y : separable Hilbert space K(µ P, µ Q ) L(Y ).
33 Extensions 2 Vector-valued output: Y : separable Hilbert space K(µ P, µ Q ) L(Y ). Prediction on a test bag ˆP: ŷ ( ˆP) = g T (G + lλi) 1 y, g = [K(µ ˆP, µ ˆPi )], G = [K(µ ˆPi, µ ˆPj )], y = [y i ]. Specifically: Y = R L(Y ) = R; Y = R d L(Y ) = R d d.
34 Misspecified case: consistency Our result Let Then, N = Õ (l), l, λ 0, λ l. ( R f λ ẑ ) R (f ρ ) inf f H f f ρ L 2.
35 Misspecified case: s-smooth Let N = Õ ( l 2a). f ρ : s-smooth, s (0, 1]. Our result (computational & statistical tradeoff) If s+1 s+2 a, then R ( fẑ λ ) ) R (fρ ) = O (l 2s s+2.
36 Misspecified case: s-smooth Let N = Õ ( l 2a). f ρ : s-smooth, s (0, 1]. Our result (computational & statistical tradeoff) If s+1 s+2 a, then R ( fẑ λ ) ) R (fρ ) = O (l 2s s+2. If a s+1 s+2, then R ( fẑ λ ) ) R (fρ ) = O (l 2sa s+1. Meaning: Smaller a: computational saving, but reduced statistical efficiency.
37 Misspecified case: s-smooth Let N = Õ ( l 2a). f ρ : s-smooth, s (0, 1]. Our result (computational & statistical tradeoff) If s+1 s+2 a, then R ( fẑ λ ) ) R (fρ ) = O (l 2s s+2. If a s+1 s+2, then R ( fẑ λ ) ) R (fρ ) = O (l 2sa s+1. Meaning: Smaller a: computational saving, but reduced statistical efficiency. Sensible choice: a s+1 s a 4 3 < 2!
38 Misspecified case: s-smooth Let N = Õ ( l 2a). f ρ : s-smooth, s (0, 1]. Our result (computational & statistical tradeoff) If s+1 s+2 a, then R ( fẑ λ ) ) R (fρ ) = O (l 2s s+2. If a s+1 s+2, then R ( fẑ λ ) ) R (fρ ) = O (l 2sa s+1. Meaning: Smaller a: computational saving, but reduced statistical efficiency. Sensible choice: a s+1 s a 4 3 < 2! s 2s s+2 is increasing: easier task = better rate. s 0 ( f ρ L 2 only): arbitrary slow rate. s = 1: O(l 2 3 ) speed.
39 Misspecified case: optimality Our rate: r(l) = l 2s s+2. One-stage sampled optimal rate: r o (l) = l 2s 2s+1 [Steinwart et al., 2009], s-smoothness + eigendecay constraint, D: compact metric, Y = R.
40 Misspecified case: optimality Our rate: r(l) = l 2s s+2. One-stage sampled optimal rate: r o (l) = l 2s 2s+1 [Steinwart et al., 2009], s-smoothness + eigendecay constraint, D: compact metric, Y = R Rate log l r o (l)=2s/(2s+1) log l r(l)=2s/(s+2) Smoothness (s)
41 s-smoothness: intuition Assumption: f ρ Im(C s ), s (0, 1]. C = uncentered covariance.
42 s-smoothness: intuition Assumption: f ρ Im(C s ), s (0, 1]. C = uncentered covariance. Imagine: C R d d is a symmetric matrix, C = UΛU T
43 s-smoothness: intuition Assumption: f ρ Im(C s ), s (0, 1]. C = uncentered covariance. Imagine: C R d d is a symmetric matrix, C = UΛU T, Cv = d λ n u n, v u n. n=1
44 s-smoothness: intuition Assumption: f ρ Im(C s ), s (0, 1]. C = uncentered covariance. Imagine: C R d d is a symmetric matrix, General C: C = UΛU T, Cv = d λ n u n, v u n. n=1 C(v) = n λ n u n, v u n, C s (v) = λ s n u n, v u n, n { Im(C s ) = c n u n : cnλ 2 2s n n n < }. Larger s faster decay of the c n Fourier coefficients.
45 Aerosol prediction result (100 RMSE) We perform on par with the state-of-the-art, hand-engineered method. Zhuang Wang, Liang Lan, Slobodan Vucetic. IEEE Transactions on Geoscience and Remote Sensing, 2012: (± ): hand-crafted features. Our prediction accuracy: 7.81 (±1.64). no expert knowledge. Code in ITE: #2 on mloss,
46 Summary Problem: distribution regression. Contribution: computational & statistical tradeoff analysis, set kernel: simple algorithm with minimax optimal rate. Learning Theory for Distribution Regression. Journal of Machine Learning Research, 17(152):1-40, 2016.
47 Thank you for the attention! Acknowledgments: This work was supported by the Gatsby Charitable Foundation, and by NSF grants IIS and IIS A part of the work was carried out while Bharath K. Sriperumbudur was a research fellow in the Statistical Laboratory, Department of Pure Mathematics and Mathematical Statistics at the University of Cambridge, UK.
48 Altun, Y. and Smola, A. (2006). Unifying divergence minimization and statistical inference via convex duality. In Conference on Learning Theory (COLT), pages Berlinet, A. and Thomas-Agnan, C. (2004). Reproducing Kernel Hilbert Spaces in Probability and Statistics. Kluwer. Caponnetto, A. and De Vito, E. (2007). Optimal rates for regularized least-squares algorithm. Foundations of Computational Mathematics, 7: Gärtner, T., Flach, P. A., Kowalczyk, A., and Smola, A. (2002). Multi-instance kernels. In International Conference on Machine Learning (ICML), pages Haussler, D. (1999).
49 Convolution kernels on discrete structures. Technical report, Department of Computer Science, University of California at Santa Cruz. ( convolutions.pdf). Smola, A., Gretton, A., Song, L., and Schölkopf, B. (2007). A Hilbert space embedding for distributions. In Algorithmic Learning Theory (ALT), pages Steinwart, I., Hush, D. R., and Scovel, C. (2009). Optimal rates for regularized least squares regression. In Conference on Learning Theory (COLT).
Minimax-optimal distribution regression
Zoltán Szabó (Gatsby Unit, UCL) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) ISNPS, Avignon June 12,
More informationDistribution Regression with Minimax-Optimal Guarantee
Distribution Regression with Minimax-Optimal Guarantee (Gatsby Unit, UCL) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton
More informationDistribution Regression
Zoltán Szabó (École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) Dagstuhl Seminar 16481
More informationTwo-stage Sampled Learning Theory on Distributions
Two-stage Sampled Learning Theory on Distributions Zoltán Szabó (Gatsby Unit, UCL) Joint work with Arthur Gretton (Gatsby Unit, UCL), Barnabás Póczos (ML Department, CMU), Bharath K. Sriperumbudur (Department
More informationKernel Learning via Random Fourier Representations
Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random
More informationTensor Product Kernels: Characteristic Property, Universality
Tensor Product Kernels: Characteristic Property, Universality Zolta n Szabo CMAP, E cole Polytechnique Joint work with: Bharath K. Sriperumbudur Hangzhou International Conference on Frontiers of Data Science
More informationMinimax Estimation of Kernel Mean Embeddings
Minimax Estimation of Kernel Mean Embeddings Bharath K. Sriperumbudur Department of Statistics Pennsylvania State University Gatsby Computational Neuroscience Unit May 4, 2016 Collaborators Dr. Ilya Tolstikhin
More informationAn Adaptive Test of Independence with Analytic Kernel Embeddings
An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum 1 Zoltán Szabó 2 Arthur Gretton 1 1 Gatsby Unit, University College London 2 CMAP, École Polytechnique ICML 2017, Sydney
More informationStatistical Optimality of Stochastic Gradient Descent through Multiple Passes
Statistical Optimality of Stochastic Gradient Descent through Multiple Passes Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Loucas Pillaud-Vivien
More informationConvergence Rates of Kernel Quadrature Rules
Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction
More informationKernel methods for Bayesian inference
Kernel methods for Bayesian inference Arthur Gretton Gatsby Computational Neuroscience Unit Lancaster, Nov. 2014 Motivating Example: Bayesian inference without a model 3600 downsampled frames of 20 20
More informationTwo-stage sampled learning theory on distributions
Zoltán Szabó 1 Arthur Gretton 1 Barnabás Póczos Bharath Sriperumbudur 3 1 Gatsby Unit, UCL Machine Learning Department, CMU 3 Department of Statistics, PSU Abstract We focus on the distribution regression
More informationApproximate Kernel PCA with Random Features
Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,
More informationMathematical Methods for Data Analysis
Mathematical Methods for Data Analysis Massimiliano Pontil Istituto Italiano di Tecnologia and Department of Computer Science University College London Massimiliano Pontil Mathematical Methods for Data
More informationAn Adaptive Test of Independence with Analytic Kernel Embeddings
An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum Gatsby Unit, University College London wittawat@gatsby.ucl.ac.uk Probabilistic Graphical Model Workshop 2017 Institute
More informationRobust Support Vector Machines for Probability Distributions
Robust Support Vector Machines for Probability Distributions Andreas Christmann joint work with Ingo Steinwart (Los Alamos National Lab) ICORS 2008, Antalya, Turkey, September 8-12, 2008 Andreas Christmann,
More informationAdvanced Introduction to Machine Learning
10-715 Advanced Introduction to Machine Learning Homework Due Oct 15, 10.30 am Rules Please follow these guidelines. Failure to do so, will result in loss of credit. 1. Homework is due on the due date
More informationApproximation Theoretical Questions for SVMs
Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually
More informationLess is More: Computational Regularization by Subsampling
Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro
More informationReproducing Kernel Hilbert Spaces
9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we
More informationKernel methods for comparing distributions, measuring dependence
Kernel methods for comparing distributions, measuring dependence Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Principal component analysis Given a set of M centered observations
More informationRecovering Distributions from Gaussian RKHS Embeddings
Motonobu Kanagawa Graduate University for Advanced Studies kanagawa@ism.ac.jp Kenji Fukumizu Institute of Statistical Mathematics fukumizu@ism.ac.jp Abstract Recent advances of kernel methods have yielded
More informationKernels. B.Sc. École Polytechnique September 4, 2018
CMAP B.Sc. Day @ École Polytechnique September 4, 2018 Inner product Ñ kernel: similarity between features Extension of kpx, yq x, y ř i x iy i : kpx, yq ϕpxq, ϕpyq H. Inner product Ñ kernel: similarity
More informationMetric Embedding for Kernel Classification Rules
Metric Embedding for Kernel Classification Rules Bharath K. Sriperumbudur University of California, San Diego (Joint work with Omer Lang & Gert Lanckriet) Bharath K. Sriperumbudur (UCSD) Metric Embedding
More informationStochastic optimization in Hilbert spaces
Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert
More informationKernel Bayes Rule: Nonparametric Bayesian inference with kernels
Kernel Bayes Rule: Nonparametric Bayesian inference with kernels Kenji Fukumizu The Institute of Statistical Mathematics NIPS 2012 Workshop Confluence between Kernel Methods and Graphical Models December
More informationOnline Gradient Descent Learning Algorithms
DISI, Genova, December 2006 Online Gradient Descent Learning Algorithms Yiming Ying (joint work with Massimiliano Pontil) Department of Computer Science, University College London Introduction Outline
More informationSimultaneous Model Selection and Optimization through Parameter-free Stochastic Learning
Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Francesco Orabona Yahoo! Labs New York, USA francesco@orabona.com Abstract Stochastic gradient descent algorithms
More informationThe Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017
The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the
More informationA Linear-Time Kernel Goodness-of-Fit Test
A Linear-Time Kernel Goodness-of-Fit Test Wittawat Jitkrittum 1; Wenkai Xu 1 Zoltán Szabó 2 NIPS 2017 Best paper! Kenji Fukumizu 3 Arthur Gretton 1 wittawatj@gmail.com 1 Gatsby Unit, University College
More information5.6 Nonparametric Logistic Regression
5.6 onparametric Logistic Regression Dmitri Dranishnikov University of Florida Statistical Learning onparametric Logistic Regression onparametric? Doesnt mean that there are no parameters. Just means that
More informationStructured Prediction
Structured Prediction Ningshan Zhang Advanced Machine Learning, Spring 2016 Outline Ensemble Methods for Structured Prediction[1] On-line learning Boosting AGeneralizedKernelApproachtoStructuredOutputLearning[2]
More informationHilbert Space Representations of Probability Distributions
Hilbert Space Representations of Probability Distributions Arthur Gretton joint work with Karsten Borgwardt, Kenji Fukumizu, Malte Rasch, Bernhard Schölkopf, Alex Smola, Le Song, Choon Hui Teo Max Planck
More informationKernel-Based Formulations of Spatio-Spectral Transform and Three Related Transforms on the Sphere
Kernel-Based Formulations of Spatio-Spectral Transform and Three Related Transforms on the Sphere Rod Kennedy 1 rodney.kennedy@anu.edu.au 1 Australian National University Azores Antipode Tuesday 15 July
More informationLess is More: Computational Regularization by Subsampling
Less is More: Computational Regularization by Subsampling Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu joint work with Alessandro
More informationA GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong
A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES Wei Chu, S. Sathiya Keerthi, Chong Jin Ong Control Division, Department of Mechanical Engineering, National University of Singapore 0 Kent Ridge Crescent,
More informationRobust Low Rank Kernel Embeddings of Multivariate Distributions
Robust Low Rank Kernel Embeddings of Multivariate Distributions Le Song, Bo Dai College of Computing, Georgia Institute of Technology lsong@cc.gatech.edu, bodai@gatech.edu Abstract Kernel embedding of
More informationUnsupervised Nonparametric Anomaly Detection: A Kernel Method
Fifty-second Annual Allerton Conference Allerton House, UIUC, Illinois, USA October - 3, 24 Unsupervised Nonparametric Anomaly Detection: A Kernel Method Shaofeng Zou Yingbin Liang H. Vincent Poor 2 Xinghua
More informationApproximate Kernel Methods
Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression
More information22 : Hilbert Space Embeddings of Distributions
10-708: Probabilistic Graphical Models 10-708, Spring 2014 22 : Hilbert Space Embeddings of Distributions Lecturer: Eric P. Xing Scribes: Sujay Kumar Jauhar and Zhiguang Huo 1 Introduction and Motivation
More informationWorst-Case Bounds for Gaussian Process Models
Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis
More informationAdvances in kernel exponential families
Advances in kernel exponential families Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS, 2017 1/39 Outline Motivating application: Fast estimation of complex multivariate
More informationMATH 829: Introduction to Data Mining and Analysis Support vector machines and kernels
1/12 MATH 829: Introduction to Data Mining and Analysis Support vector machines and kernels Dominique Guillot Departments of Mathematical Sciences University of Delaware March 14, 2016 Separating sets:
More informationBayesian Support Vector Machines for Feature Ranking and Selection
Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction
More information10-701/ Recitation : Kernels
10-701/15-781 Recitation : Kernels Manojit Nandi February 27, 2014 Outline Mathematical Theory Banach Space and Hilbert Spaces Kernels Commonly Used Kernels Kernel Theory One Weird Kernel Trick Representer
More informationKernels A Machine Learning Overview
Kernels A Machine Learning Overview S.V.N. Vishy Vishwanathan vishy@axiom.anu.edu.au National ICT of Australia and Australian National University Thanks to Alex Smola, Stéphane Canu, Mike Jordan and Peter
More informationConnection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis
Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis Alvina Goh Vision Reading Group 13 October 2005 Connection of Local Linear Embedding, ISOMAP, and Kernel Principal
More informationKernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton
Kernel Methods Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra,
More informationSupport Vector Machines: Kernels
Support Vector Machines: Kernels CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 14.1, 14.2, 14.4 Schoelkopf/Smola Chapter 7.4, 7.6, 7.8 Non-Linear Problems
More informationReproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto
Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationOslo Class 2 Tikhonov regularization and kernels
RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n
More informationRegML 2018 Class 2 Tikhonov regularization and kernels
RegML 2018 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT June 17, 2018 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n = (x i,
More informationDivide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates
: A Distributed Algorithm with Minimax Optimal Rates Yuchen Zhang, John C. Duchi, Martin Wainwright (UC Berkeley;http://arxiv.org/pdf/1305.509; Apr 9, 014) Gatsby Unit, Tea Talk June 10, 014 Outline Motivation.
More informationHilbert Space Embedding of Probability Measures
Lecture 2 Hilbert Space Embedding of Probability Measures Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 2017 Recap of Lecture
More informationGaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature
More informationAdaptive HMC via the Infinite Exponential Family
Adaptive HMC via the Infinite Exponential Family Arthur Gretton Gatsby Unit, CSML, University College London RegML, 2017 Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family
More informationCharacterizing Independence with Tensor Product Kernels
Characterizing Independence with Tensor Product Kernels Zolta n Szabo CMAP, E cole Polytechnique Joint work with: Bharath K. Sriperumbudur Department of Statistics, PSU December 13, 2017 Zolta n Szabo
More informationGeneralized clustering via kernel embeddings
Generalized clustering via kernel embeddings Stefanie Jegelka 1, Arthur Gretton 2,1, Bernhard Schölkopf 1, Bharath K. Sriperumbudur 3, and Ulrike von Luxburg 1 1 Max Planck Institute for Biological Cybernetics,
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationIndependent Subspace Analysis
Independent Subspace Analysis Barnabás Póczos Supervisor: Dr. András Lőrincz Eötvös Loránd University Neural Information Processing Group Budapest, Hungary MPI, Tübingen, 24 July 2007. Independent Component
More informationHypothesis Testing with Kernel Embeddings on Interdependent Data
Hypothesis Testing with Kernel Embeddings on Interdependent Data Dino Sejdinovic Department of Statistics University of Oxford joint work with Kacper Chwialkowski and Arthur Gretton (Gatsby Unit, UCL)
More informationKernels for Multi task Learning
Kernels for Multi task Learning Charles A Micchelli Department of Mathematics and Statistics State University of New York, The University at Albany 1400 Washington Avenue, Albany, NY, 12222, USA Massimiliano
More informationRegularization in Reproducing Kernel Banach Spaces
.... Regularization in Reproducing Kernel Banach Spaces Guohui Song School of Mathematical and Statistical Sciences Arizona State University Comp Math Seminar, September 16, 2010 Joint work with Dr. Fred
More informationMIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 04: Features and Kernels. Lorenzo Rosasco
MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 04: Features and Kernels Lorenzo Rosasco Linear functions Let H lin be the space of linear functions f(x) = w x. f w is one
More informationActive and Semi-supervised Kernel Classification
Active and Semi-supervised Kernel Classification Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London Work done in collaboration with Xiaojin Zhu (CMU), John Lafferty (CMU),
More informationDistribution-Free Distribution Regression
Distribution-Free Distribution Regression Barnabás Póczos, Alessandro Rinaldo, Aarti Singh and Larry Wasserman AISTATS 2013 Presented by Esther Salazar Duke University February 28, 2014 E. Salazar (Reading
More informationNotes on Regularized Least Squares Ryan M. Rifkin and Ross A. Lippert
Computer Science and Artificial Intelligence Laboratory Technical Report MIT-CSAIL-TR-2007-025 CBCL-268 May 1, 2007 Notes on Regularized Least Squares Ryan M. Rifkin and Ross A. Lippert massachusetts institute
More informationLecture 10: Support Vector Machine and Large Margin Classifier
Lecture 10: Support Vector Machine and Large Margin Classifier Applied Multivariate Analysis Math 570, Fall 2014 Xingye Qiao Department of Mathematical Sciences Binghamton University E-mail: qiao@math.binghamton.edu
More informationEfficient Complex Output Prediction
Efficient Complex Output Prediction Florence d Alché-Buc Joint work with Romain Brault, Alex Lambert, Maxime Sangnier October 12, 2017 LTCI, Télécom ParisTech, Institut-Mines Télécom, Université Paris-Saclay
More informationA Kernel Between Sets of Vectors
A Kernel Between Sets of Vectors Risi Kondor Tony Jebara Columbia University, New York, USA. 1 A Kernel between Sets of Vectors In SVM, Gassian Processes, Kernel PCA, kernel K de nes feature map Φ : X
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationHilbert Schmidt Independence Criterion
Hilbert Schmidt Independence Criterion Thanks to Arthur Gretton, Le Song, Bernhard Schölkopf, Olivier Bousquet Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au
More informationAdvanced Introduction to Machine Learning CMU-10715
Advanced Introduction to Machine Learning CMU-10715 Gaussian Processes Barnabás Póczos http://www.gaussianprocess.org/ 2 Some of these slides in the intro are taken from D. Lizotte, R. Parr, C. Guesterin
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal In this class we continue our journey in the world of RKHS. We discuss the Mercer theorem which gives
More information1 Exponential manifold by reproducing kernel Hilbert spaces
1 Exponential manifold by reproducing kernel Hilbert spaces 1.1 Introduction The purpose of this paper is to propose a method of constructing exponential families of Hilbert manifold, on which estimation
More information9.2 Support Vector Machines 159
9.2 Support Vector Machines 159 9.2.3 Kernel Methods We have all the tools together now to make an exciting step. Let us summarize our findings. We are interested in regularized estimation problems of
More informationBig Hypothesis Testing with Kernel Embeddings
Big Hypothesis Testing with Kernel Embeddings Dino Sejdinovic Department of Statistics University of Oxford 9 January 2015 UCL Workshop on the Theory of Big Data D. Sejdinovic (Statistics, Oxford) Big
More informationNon-Linear Regression for Bag-of-Words Data via Gaussian Process Latent Variable Set Model
Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence Non-Linear Regression for Bag-of-Words Data via Gaussian Process Latent Variable Set Model Yuya Yoshikawa Nara Institute of Science
More informationA Magiv CV Theory for Large-Margin Classifiers
A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector
More informationLearnability of Gaussians with flexible variances
Learnability of Gaussians with flexible variances Ding-Xuan Zhou City University of Hong Kong E-ail: azhou@cityu.edu.hk Supported in part by Research Grants Council of Hong Kong Start October 20, 2007
More informationJoint Emotion Analysis via Multi-task Gaussian Processes
Joint Emotion Analysis via Multi-task Gaussian Processes Daniel Beck, Trevor Cohn, Lucia Specia October 28, 2014 1 Introduction 2 Multi-task Gaussian Process Regression 3 Experiments and Discussion 4 Conclusions
More informationEECS 598: Statistical Learning Theory, Winter 2014 Topic 11. Kernels
EECS 598: Statistical Learning Theory, Winter 2014 Topic 11 Kernels Lecturer: Clayton Scott Scribe: Jun Guo, Soumik Chatterjee Disclaimer: These notes have not been subjected to the usual scrutiny reserved
More informationSTAT 518 Intro Student Presentation
STAT 518 Intro Student Presentation Wen Wei Loh April 11, 2013 Title of paper Radford M. Neal [1999] Bayesian Statistics, 6: 475-501, 1999 What the paper is about Regression and Classification Flexible
More informationKernel Ridge Regression. Mohammad Emtiyaz Khan EPFL Oct 27, 2015
Kernel Ridge Regression Mohammad Emtiyaz Khan EPFL Oct 27, 2015 Mohammad Emtiyaz Khan 2015 Motivation The ridge solution β R D has a counterpart α R N. Using duality, we will establish a relationship between
More informationConvergence rates of spectral methods for statistical inverse learning problems
Convergence rates of spectral methods for statistical inverse learning problems G. Blanchard Universtität Potsdam UCL/Gatsby unit, 04/11/2015 Joint work with N. Mücke (U. Potsdam); N. Krämer (U. München)
More informationGaussian Process Regression
Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process
More informationBeyond the Point Cloud: From Transductive to Semi-Supervised Learning
Beyond the Point Cloud: From Transductive to Semi-Supervised Learning Vikas Sindhwani, Partha Niyogi, Mikhail Belkin Andrew B. Goldberg goldberg@cs.wisc.edu Department of Computer Sciences University of
More informationManifold Regularization
Manifold Regularization Vikas Sindhwani Department of Computer Science University of Chicago Joint Work with Mikhail Belkin and Partha Niyogi TTI-C Talk September 14, 24 p.1 The Problem of Learning is
More informationThe Learning Problem and Regularization
9.520 Class 02 February 2011 Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference problem from usually small sets of high dimensional, noisy data. Learning
More informationOn learning with kernels for unordered pairs
Martial Hue MartialHue@mines-paristechfr Jean-Philippe Vert Jean-PhilippeVert@mines-paristechfr Mines ParisTech, Centre for Computational Biology, 35 rue Saint Honoré, F-77300 Fontainebleau, France Institut
More informationA Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes
A Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes Ürün Dogan 1 Tobias Glasmachers 2 and Christian Igel 3 1 Institut für Mathematik Universität Potsdam Germany
More informationInverse Density as an Inverse Problem: the Fredholm Equation Approach
Inverse Density as an Inverse Problem: the Fredholm Equation Approach Qichao Que, Mikhail Belkin Department of Computer Science and Engineering The Ohio State University {que,mbelkin}@cse.ohio-state.edu
More informationKernel Methods. Outline
Kernel Methods Quang Nguyen University of Pittsburgh CS 3750, Fall 2011 Outline Motivation Examples Kernels Definitions Kernel trick Basic properties Mercer condition Constructing feature space Hilbert
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationLecture 35: December The fundamental statistical distances
36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose
More informationSupport Vector Machine
Support Vector Machine Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Linear Support Vector Machine Kernelized SVM Kernels 2 From ERM to RLM Empirical Risk Minimization in the binary
More information26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G.
10-708: Probabilistic Graphical Models, Spring 2015 26 : Spectral GMs Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G. 1 Introduction A common task in machine learning is to work with
More informationDiffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)
Diffeomorphic Warping Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel) What Manifold Learning Isn t Common features of Manifold Learning Algorithms: 1-1 charting Dense sampling Geometric Assumptions
More informationOptimal kernel methods for large scale learning
Optimal kernel methods for large scale learning Alessandro Rudi INRIA - École Normale Supérieure, Paris joint work with Luigi Carratino, Lorenzo Rosasco 6 Mar 2018 École Polytechnique Learning problem
More informationKernel Methods. Barnabás Póczos
Kernel Methods Barnabás Póczos Outline Quick Introduction Feature space Perceptron in the feature space Kernels Mercer s theorem Finite domain Arbitrary domain Kernel families Constructing new kernels
More information