Learnability of Gaussians with flexible variances

Similar documents
Shannon Sampling II. Connections to Learning Theory

Multi-kernel Regularized Classifiers

Yongquan Zhang a, Feilong Cao b & Zongben Xu a a Institute for Information and System Sciences, Xi'an Jiaotong. Available online: 11 Mar 2011

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory

A Simple Regression Problem

SVM Soft Margin Classifiers: Linear Programming versus Quadratic Programming

Probabilistic Machine Learning

CS Lecture 13. More Maximum Likelihood

The Learning Problem and Regularization

Machine Learning Basics: Estimators, Bias and Variance

1 Bounding the Margin

Approximation Theoretical Questions for SVMs

Journal of Mathematical Analysis and Applications

An l 1 Regularized Method for Numerical Differentiation Using Empirical Eigenfunctions

Learning Theory of Randomized Kaczmarz Algorithm

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

Manifold learning via Multi-Penalty Regularization

Online Gradient Descent Learning Algorithms

Geometry on Probability Spaces

Symmetrization and Rademacher Averages

Estimating Parameters for a Gaussian pdf

1 Generalization bounds based on Rademacher complexity

Universal algorithms for learning theory Part II : piecewise polynomial functions

Combining Classifiers

Theore A. Let n (n 4) be arcoplete inial iersed hypersurface in R n+. 4(n ) Then there exists a constant C (n) = (n 2)n S (n) > 0 such that if kak n d

Metric Entropy of Convex Hulls

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

PAC-Bayes Analysis Of Maximum Entropy Learning

Lecture October 23. Scribes: Ruixin Qiang and Alana Shine

On Learnability, Complexity and Stability

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Tail Estimation of the Spectral Density under Fixed-Domain Asymptotics

Kernel Methods and Support Vector Machines

Learnability and Stability in the General Learning Setting

Sharp Time Data Tradeoffs for Linear Inverse Problems

On the Homothety Conjecture

Stochastic Subgradient Methods

Approximation by Piecewise Constants on Convex Partitions

Are Loss Functions All the Same?

Support Vector Machines. Maximizing the Margin

Support Vector Machines. Goals for the lecture

Online gradient descent learning algorithm

VC Dimension and Sauer s Lemma

TUM 2016 Class 1 Statistical learning theory

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

Note on generating all subsets of a finite set with disjoint unions

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes

Lecture 12: Ensemble Methods. Introduction. Weighted Majority. Mixture of Experts/Committee. Σ k α k =1. Isabelle Guyon

Supplement to: Subsampling Methods for Persistent Homology

Derivative reproducing properties for kernel methods in learning theory

3. Some tools for the analysis of sequential strategies based on a Gaussian process prior

A NEW CHARACTERIZATION OF THE CR SPHERE AND THE SHARP EIGENVALUE ESTIMATE FOR THE KOHN LAPLACIAN. 1. Introduction

Keywords: Estimator, Bias, Mean-squared error, normality, generalized Pareto distribution

Supplementary Material for Fast and Provable Algorithms for Spectrally Sparse Signal Reconstruction via Low-Rank Hankel Matrix Completion

Complex Quadratic Optimization and Semidefinite Programming

Detection and Estimation Theory

Research in Area of Longevity of Sylphon Scraies

DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS

Fairness via priority scheduling

1 Proof of learning bounds

Rademacher Averages and Phase Transitions in Glivenko Cantelli Classes

FAST DYNAMO ON THE REAL LINE

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Max-Product Shepard Approximation Operators

3.8 Three Types of Convergence

Online Learning with Samples Drawn from Non-identical Distributions

3.3 Variational Characterization of Singular Values

L p moments of random vectors via majorizing measures

Feature Extraction Techniques

Understanding Machine Learning Solution Manual

Learning gradients: prescriptive models

LIOUVILLE THEOREM AND GRADIENT ESTIMATES FOR NONLINEAR ELLIPTIC EQUATIONS ON RIEMANNIAN MANIFOLDS

Lecture 21 Nov 18, 2015

Computable Shell Decomposition Bounds

Robustness and Regularization of Support Vector Machines

Reproducing Kernel Hilbert Spaces

PREPRINT 2006:17. Inequalities of the Brunn-Minkowski Type for Gaussian Measures CHRISTER BORELL

APPROXIMATION BY GENERALIZED FABER SERIES IN BERGMAN SPACES ON INFINITE DOMAINS WITH A QUASICONFORMAL BOUNDARY

RIGIDITY OF QUASI-EINSTEIN METRICS

Bounds on the Minimax Rate for Estimating a Prior over a VC Class from Independent Learning Tasks

Variance Reduction. in Statistics, we deal with estimators or statistics all the time

New Bounds for Learning Intervals with Implications for Semi-Supervised Learning

Certain Subalgebras of Lipschitz Algebras of Infinitely Differentiable Functions and Their Maximal Ideal Spaces

ISSN , Volume 32, Number 2

A MESHSIZE BOOSTING ALGORITHM IN KERNEL DENSITY ESTIMATION

WEIGHTED PERSISTENT HOMOLOGY SUMS OF RANDOM ČECH COMPLEXES

Non-Parametric Non-Line-of-Sight Identification 1

Learnability, Stability and Uniform Convergence

Research Article Robust ε-support Vector Regression

Reproducing Kernel Hilbert Spaces

Bayesian Models for Regularization in Optimization

A Note on the Applied Use of MDL Approximations

Reproducing Kernel Hilbert Spaces

Transcription:

Learnability of Gaussians with flexible variances Ding-Xuan Zhou City University of Hong Kong E-ail: azhou@cityu.edu.hk Supported in part by Research Grants Council of Hong Kong Start October 20, 2007

Least-square Regularized Regression Learn f : X Y fro rando saples z = {(x i, y i )} i=1 Take X to be a copact subset of R n and Y = R. y f(x) Due to noises or other uncertainty, we assue a (unknown) probability easure ρ on Z = X Y governs the sapling. arginal distribution ρ X on X: {x i } i=1 drawn according to ρ X conditional distribution ρ( x) at x X Learning the regression function: f ρ (x) = Y ydρ(y x) y i f ρ (x i ) First Previous Next Last Back Close Quit 1

Learning with a Fixed Gaussian f z,λ,σ := arg in f H K σ 1 i=1 (f(x i ) y i ) 2 + λ f 2 K σ, (1) where λ = λ() > 0, and K σ (x, y) = e x y 2 2σ 2 kernel on X is a Gaussian Reproducing Kernel Hilbert Space (RKHS) H Kσ copletion of span{(k σ ) t := K σ (t, ) : t X} with the inner product, Kσ satisfying (K σ ) x, (K σ ) y K = K σ (x, y). First Previous Next Last Back Close Quit 2

Theore 1 (Sale-Zhou, Constr. Approx. 2007) Assue y M and that f ρ = X K σ(x, y)g(y)dρ X (y) for soe g L 2 ρ X. For any 0 < δ < 1, with confidence 1 δ, f z,λ,σ f ρ L 2 2 log ( 4/δ )( 12M ) 2/3 1/3 ( 1 ) 1/3 g ρx L 2 ρ X where λ = λ() = log ( 4/δ )( 12M/ g L 2 ρx ) 2/3 ( 1/ ) 1/3. In Theore 1, f ρ C First Previous Next Last Back Close Quit 3

RKHS H Kσ generated by a Gaussian kernel on X H Kσ = H Kσ (R n ) X where H Kσ (R n ) is the RKHS generated by K σ as a Mercer kernel on R n : H Kσ (R n ) = { f L 2 (R n ) : f HK σ (Rn ) < } where f HK σ (Rn ) = Rn ˆf(ξ) 2 ( 2πσ) n e σ2 ξ 2 2 dξ 1/2. Thus H Kσ (R n ) C (R n ) Steinwart First Previous Next Last Back Close Quit 4

If X is a doain with piecewise sooth boundary and dρ X (x) c 0 dx for soe c 0 > 0, then for any β > 0, } D σ (λ) := { f f ρ 2L 2ρX + λ f 2 K σ = O(λ β ) inf f H K σ iplies f ρ C (X). Note f f ρ 2 L 2 ρ X = E(f) E(f ρ ) where E(f) := Z (f(x) y)2 dρ. Denote E z (f) = 1 i=1 (f(x i ) y i ) 2 E(f). Then f z,λ,σ = arg in f H K σ { Ez (f) + λ f 2 K σ }. First Previous Next Last Back Close Quit 5

If we define f λ,σ = arg in f H K σ { E(f) + λ f 2 Kσ }, then f z,λ,σ f λ,σ and the error can be estiated in ters of λ by the theory of unifor convergence over the copact function set B M/ λ := {f H Kσ : f Kσ M/ λ} since f z,λ,σ B M/ λ. But f λ,σ f ρ 2 L 2 ρ X = O(λ β ) for any β > 0 iplies f ρ C (X). So the learning ability of a single Gaussian is weak. One ay choose less sooth kernel, but we would like radial basis kernels for anifold learning. One way to increase the learning ability of Gaussian kernels: let σ depend on and σ = σ() 0 as. Steinwart-Scovel, Xiang-Zhou,... First Previous Next Last Back Close Quit 6

Another way: allow all possible variances σ (0, ) Regularization Schees with Flexible Gaussians: Zhou, Wu-Ying-Zhou, Ying-Zhou, Micchelli-Pontil-Wu-Zhou,... f z,λ := arg in 0<σ< in f H K σ 1 i=1 (f(x i ) y i ) 2 + λ f 2 K σ Theore 2 (Ying-Zhou, J. Mach. Learning Res. 2007) Let ρ X be the Lebesgue easure on a doain X in R n with inially sooth boundary. If f ρ H s (X) for soe s 2 and λ = 2s+n 4(4s+n), then we have ( E z Z fz,λ f ρ 2 ) L 2 = O ( s ) 2(4s+n) log.. First Previous Next Last Back Close Quit 7

Major difficulty: is the function set H = 0<σ< { f HKσ : f Kσ R } with R > 0 learnable? That is, is this function set a unifor Glivenko-Cantelli class? Its closure is not a copact subset of C(X). Theory of Unifor Convergence for sup f H E z (f) E(f). Given a bounded set H of functions on X, when do we have li sup l ρ Prob sup l sup f H 1 i=1 f(x i ) f(x)dρ X > ɛ = 0, ɛ > 0? Such a set is called a unifor Glivenko-Cantelli (UGC) class. Characterizations: Vapnik-Chervonenkis, and Alon, Ben-David, Cesa-Bianchi, Haussler (1997) First Previous Next Last Back Close Quit 8

Our quantitative estiates: If V : Y R R + is convex with respect to the second variable, M = V (y, 0) L ρ (Z) <, and C R = sup{ax{ V (y, t), V + (y, t) } : y Y, t R} <, then we have { supf H 1 i=1 V (y i, f(x i )) Z V (y, f(x))dρ } E z Z C C R R log 1/4 + 2M, where C is a constant depending on n. Ideas: reducing the estiates for H to a uch saller subset F = {(K σ ) x : x X, 0 < σ < }, then bounding epirical covering nubers. The UGC property follows fro the characterization of Dudley-Giné-Zinn. First Previous Next Last Back Close Quit 9

Iprove the learning rates when X is a anifold of diension d with d uch saller than the diension n of the underlying Euclidean space. Approxiation by Gaussians on Rieannian anifolds Let X be a d-diensional connected copact C subanifold of R n without boundary. The approxiation schee is given by a faily of linear operators {I σ : C(X) C(X)} σ>0 as I σ (f)(x) = = 1 ( 2πσ) d 1 ( 2πσ) d X K σ(x, y)f(y)dv (y) X exp { x y 2 2σ 2 where V is the Rieannian volue easure of X. } f(y)dv (y), x X, First Previous Next Last Back Close Quit 10

Theore 3 (Ye-Zhou, Adv. Coput. Math. 2007) If f ρ Lip(s) with 0 < s 1, then I σ (f ρ ) f ρ C(X) C X f ρ Lip(s) σ s σ > 0, (2) where C X is a positive constant independent of f ρ or σ. taking λ = ( log 2 ) s+d 8s+4d, we have E z Z { f z,λ f ρ 2L 2ρX } = O (( log 2 ) s 8s+4d ). By s The index 8s+4d in Theore 3 is saller than s in Theore 2 when the anifold diension d is uch saller than 2(4s+n) n. First Previous Next Last Back Close Quit 11

Classification by Gaussians on Rieannian anifolds Let φ(t) = ax{1 t, 0} be the hinge loss for the support vector achine classification. Define f z,λ = arg in σ (0,+ ) in f H K σ 1 i=1 φ (y i f(x i )) + λ f 2 K σ. By using I σ : L p (X) L p (X), we obtain learning rates for binary classification to learn the Bayes rule: f c (x) = { 1, if ρ(y = 1 x) ρ(y = 1 x) 1, if ρ(y = 1 x) < ρ(y = 1 x). Here Y = {1, 1} represents two classes. The isclassification error is defined as R(f) : Prob{y f(x)} R(f c ) for any f : X Y. First Previous Next Last Back Close Quit 12

The Sobolev space H k p (X) is the copletion of C (X) with respect to the nor f H k p (X) = k j=0 ( X j f p dv ) 1/p, where j f denotes the jth covariant derivative of f. Theore 4 If f c lies in the interpolation space (L 1 (X), H1 2(X)) θ ( ) 2θ+d 12θ+2d, we have for soe 0 < θ 1, then by taking λ = E z Z { R(sgn(fz,λ )) R(f c ) } C where C is a constant independent of. log 2 ( log 2 ) θ 6θ+d, First Previous Next Last Back Close Quit 13

Ongoing topics: variable selection diensionality reduction graph Laplacian diffusion ap First Previous Next Last Back Close Quit 14