Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector

Dimesio-free PAC-Bayesia bouds for the estimatio of the mea of a radom vector Olivier Catoi CREST CNRS UMR 9194 Uiversité Paris Saclay olivier.catoi@esae.fr Ilaria Giulii Laboratoire de Probabilités et Modèles Aléatoires Uiversité Paris Diderot giulii@math.uiv-paris-diderot.fr Abstract I this paper, we preset a ew estimator of the mea of a radom vector, computed by applyig some threshold fuctio to the orm. No asymptotic dimesio-free almost sub-gaussia bouds are proved uder weak momet assumptios, usig PAC-Bayesia iequalities. 1 Itroductio Estimatig the mea of a radom vector uder weak tail assumptios has attracted a lot of attetio recetly. A umber of properties have spurred the iterest for these ew results, where the empirical mea is replaced by a more robust estimator. Oe aspect is that it is possible to obtai a estimator with a sub-gaussia tail while assumig much weaker assumptios o the data, up to the fact of assumig oly the existece of a fiite covariace matrix. Aother appealig feature is that it is possible to obtai dimesio-free o asymptotic bouds that remai valid i a separable Hilbert space. Some importat refereces are Catoi [01] i the oe dimesioal case ad Misker [015] ad Lugosi ad Medelso [017] i the multidimesioal case. Buildig o the breakthrough of Misker [015], that uses a multidimesioal geeralizatio of the media of meas estimator, Joly et al. [017] ad Lugosi ad Medelso [017] propose successive improvemets of the media of meas approach to get a estimator with a geuie sub-gaussia dimesio-free tail boud, while still requirig oly the existece of the covariace matrix. I the mea time, the M-estimator approach of Catoi [01] has also bee geeralized to multidimesioal settigs through the use of matrix iequalities i Misker [016] ad Misker ad Wei [017]. Here we follow a differet route, based o a multidimesioal extesio of Catoi [01] usig PAC-Bayesia bouds. Our ew estimator is a simple modificatio of the empirical mea, where some threshold is applied to the orm of the sample vectors. Therefore, it is straightforward to compute, ad this is a strog poit of our approach, compared to others. Note also that we make here some compromise o the sharpess of the estimatio error boud, i order to simplify the defiitio ad computatio of the estimator. This compromise cosists i the presece of secod order terms, while the first order terms ca be made as close as desired to a true sub-gaussia boud with exact costats, as stated i Lugosi ad Medelso [017, eq. (1.1]. With a more ivolved estimator, a true sub-gaussia boud without secod order terms is possible ad will be described i a separate publicatio. Thresholdig the orm Cosider X R d, a radom vector, ad (X 1,..., X a sample made of idepedet copies of X. The questio is to estimate E(X from the sample, uder the assumptio that E ( X p <, for some p. 31st Coferece o Neural Iformatio Processig Systems (NIPS 017, Log Beach, CA, USA.

Cosider the threshold fuctio ψ(t = mi{t, 1}, t R +, ad for some positive real parameter λ to be chose later, itroduce the thresholded sample Y i = ψ( λ X i λ X i X i. Our estimator of m = E(X will simply be the thresholded empirical mea m = 1 Propositio.1 Itroduce the icreasig fuctios g 1 (t = 1 ( exp(t 1 ad g (t = (exp(t t t 1 t, t R, Y i. that are defied by cotiuity at t = 0 ad are such that g 1 (0 = g (0 = 1. Assume that E ( X < ad that we kow v such that sup θ S d E ( θ, X m v <, where S d = { θ R d, θ = 1 } is the uit sphere of R d. For some positive real parameter µ, put log(δ λ = µ, T = max { E ( X m, v }, av ( a = g (µ 1, b = exp(µg 1 µ av T log(δ With probability at least 1 δ, av log(δ m m + bt + if p 1 C p + if p/ p C p p/, where C p = 1 ( p ( p log(δ p/ sup E ( X p θ, X m, ad (µ av θ S d C p = 1 ( p ( p log(δ p/ E ( X p a log(δ m ( 1 + m. (µ av v Remarks 1. Note that i case E ( X < but E ( X p = for p >, we ca use the boud C 1 + C 1 log(δ (T + m + 8 log(δ µ a 7µ av E( X m ( ( a log(δ 1 log(δ (T + m 1 + m = O. v µ a Note also that if we take µ = 1/4 ad assume that δ exp(, the a 1. ad b 4. If moreover E ( X p+1 <, for some p > 1, we obtai with probability at least 1 δ that.4 v log(δ 4T m m + + C p + C p+1 p/, (p+1/ meaig that the tail distributio of m m has a sub-gaussia behavior, up to secod order terms. Remark that by takig µ small, we ca make a ad b as close as desired to 1, at the expese of the values of C p ad C p.

Proof The rest of the paper is devoted to the proof of Propositio.1. A elemetary computatio shows that the threshold fuctio ψ satisfies 0 1 ψ(t t p ( p p if, t R +, (1 t p 1 where o iteger values of the expoet p are allowed. Let Y = ψ( λ X X ad m = E(Y. We λ X ca decompose the estimatio error i directio θ ito θ, m m = θ, m m + 1 θ, Y i m, θ R d. ( Itroduce α = ψ( λ X λ X ad let us deal with the first term first. As 0 1 α λp X p ( p p θ, m m = E [ (α 1 θ, X ] = E [ (α 1 θ, X m ] + E(α 1 θ, m λ p ( p p if E( X p λ p ( p p θ, X m + if E ( X p θ, m, p 1 ( p ( where r = max{0, r} is the egative part of iteger r. Let us ow look at the secod term of the decompositio (. To gai uiformity i θ, we will use a PAC-Bayesia iequality ad the family of ormal distributios ρ θ = N ( θ, β I d, bearig o the parameter θ R d, where I d R d d is the idetity matrix of size d d, ad where β is a positive parameter to be chose later o. We will use the followig PAC-Bayesia iequality without recallig its proof, that is a simple cosequece of Catoi [004, eq. (5..1 page 159]: Lemma. For ay bouded measurable fuctio f : R d R d R, for ay probability measure π M+( 1 R d, for ay δ ]0, 1[, with probability at least 1 δ, for ay probability measure ρ M 1 +(R d, 1 f ( θ, X i dρ(θ [ ( log E exp ( f(θ, X ] dρ(θ + K(ρ, π + log(δ { ( log ρ/π dρ, whe ρ π, where K is the Kullback-Liebler divergece K(ρ, π = +, otherwise. Remarkig that 1 θ, Y i m = 1 θ, Y i m dρ θ (θ, usig π = ρ 0, ad takig ito accout the fact that K(ρ θ, ρ 0 = β θ /, we obtai as a cosequece of the previous lemma that with probability at least 1 δ, for ay θ S d, 1 θ, Y i m 1 ( ( log E exp µλ θ, Y m dρ θ (θ + β µλ µλ + log(δ µλ. I our settig f is ot bouded i θ, but the required extesio is valid as explaied i Catoi [004]. Sice the logarithm is cocave, ( ( log E exp µλ θ, Y m [ ( ( ] dρ θ (θ log E exp µλ θ, Y m dρ θ (θ [ ( = log E exp (µλ θ, Y m + µ λ Y m ], β where we have used the explicit expressio of the Laplace trasform of a Gaussia distributio., 3

To go further, remidig as a source of ispiratio the proof of Beett s iequality, let us itroduce the icreasig fuctios g 1 ad g defied i Propositio.1. These fuctios will be used to boud the expoetial fuctio by polyomials. More precisely, we will exploit the fact that whe t b, exp(t 1 + t + g (bt / ad exp(t 1 + g 1 (bt. From this, it results that if t b ad u c, exp(t + u exp(t ( 1 + g 1 (cu exp(t + g 1 (c exp(bu 1 + t + g (bt / + g 1 (c exp(bu. Legitimate values for b ad c will be deduced from the remark that λ Y 1, implyig λ m 1. Namely, i our cotext, we will use b = µ ad c = µ /β. These argumets put together lead to the iequality ( E exp (µλ θ, Y m + µ λ Y m β 1 + g (µ µ λ Replacig i the previous iequalities, we obtai E( θ, Y m + exp(µg 1 ( µ µ λ β β E ( Y m. Lemma.3 With probability at least 1 δ, for ay θ S d, θ, m m = 1 θ, Y i m g (µ µλ E( θ, Y m ( µ µλ + exp(µg 1 β β E( Y m + β + log(δ. µλ Remark that θ, Y m = θ, αx m = ( α θ, X m (1 α θ, m α θ, X m + (1 α θ, m θ, X m + (1 α θ, m. Therefore, usig iequality (1 ad the defiitio of α, E ( θ, Y m E ( θ, Y m E ( θ, X m + θ, m λ p ( p p if E ( X p. p Remark also that Y = g(x, where g is a cotractio (beig the projectio o a ball. Cosequetly E ( Y m = 1 E( Y 1 Y 1 E( X 1 X = E ( X m. I view of these remarks, the previous lemma traslates to ( ( µ Lemma.4 Let a = g µ ad b exp(µg1. β With probability at least 1 δ, for ay θ S d, θ, m m aµλ E( θ, X m + bµλ β E( X m + β + log(δ µλ + if p 1 λ p + if p ( p λ p p E ( X p θ, X m ( p p E ( X p( θ, m + aµλ θ, m. Propositio.1 follows by takig b as metioed there, λ = 1 log(δ, ad β = µ av bt log(δ T log(δ, so that the coditio o b is satisfied. av av 4

Refereces O. Catoi. Statistical Learig Theory ad Stochastic Optimizatio, Lectures o Probability Theory ad Statistics, École d Été de Probabilités de Sait-Flour XXXI 001, volume 1851 of Lecture Notes i Mathematics. Spriger, 004. pages 1 69. O. Catoi. Challegig the empirical mea ad empirical variace: a deviatio study. A. Ist. Heri Poicaré, 48(4:1148 1185, 01. E. Joly, G. Lugosi, ad R. I. Oliveira. O the estimatio of the mea of a radom vector. Electroic Joural of Statistics, 11:440 451, 017. G. Lugosi ad S. Medelso. Sub-gaussia estimators of the mea of a radom vector. Aals of Statistics, to appear, 017. S. Misker. Geometric media ad robust estimatio i Baach spaces. Beroulli, 4:308 335, 015. S. Misker. Sub-Gaussia estimators of the mea of a radom matrix with heavy-tailed etries. Aals of Statistics, to appear, 016. S. Misker ad X. Wei. Estimatio of the covariace structure of heavy-tailed distributios. I NIPS 017, to appear, 017. 5