Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector

Similar documents
MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

A survey on penalized empirical risk minimization Sara A. van de Geer

7.1 Convergence of sequences of random variables

Chapter 5. Inequalities. 5.1 The Markov and Chebyshev inequalities

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem

Exponential Families and Bayesian Inference

Asymptotic distribution of products of sums of independent random variables

7.1 Convergence of sequences of random variables

Law of the sum of Bernoulli random variables

On the convergence rates of Gladyshev s Hurst index estimator

Rademacher Complexity

Lecture 3: August 31

Convergence of random variables. (telegram style notes) P.J.C. Spreij

This section is optional.

Lecture 19: Convergence

Precise Rates in Complete Moment Convergence for Negatively Associated Sequences

The random version of Dvoretzky s theorem in l n

Lecture 33: Bootstrap

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

An Introduction to Randomized Algorithms

Regression with quadratic loss

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013

Self-normalized deviation inequalities with application to t-statistic

ON THE DELOCALIZED PHASE OF THE RANDOM PINNING MODEL

arxiv: v1 [math.pr] 4 Dec 2013

REGRESSION WITH QUADRATIC LOSS

Erratum to: An empirical central limit theorem for intermittent maps

Empirical Process Theory and Oracle Inequalities

Learning Theory: Lecture Notes

The standard deviation of the mean

Notes 19 : Martingale CLT

Supplemental Material: Proofs

Glivenko-Cantelli Classes

Machine Learning Brett Bernstein

On the estimation of the mean of a random vector

Lecture 19. sup y 1,..., yn B d n

1 Review and Overview

Lecture 12: September 27

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Notes 5 : More on the a.s. convergence of sums

Detailed proofs of Propositions 3.1 and 3.2

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

1 Introduction to reducing variance in Monte Carlo simulations

1 = δ2 (0, ), Y Y n nδ. , T n = Y Y n n. ( U n,k + X ) ( f U n,k + Y ) n 2n f U n,k + θ Y ) 2 E X1 2 X1

2.2. Central limit theorem.

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

5 Birkhoff s Ergodic Theorem

ON POINTWISE BINOMIAL APPROXIMATION

Central limit theorem and almost sure central limit theorem for the product of some partial sums

The log-behavior of n p(n) and n p(n)/n

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

THE KALMAN FILTER RAUL ROJAS

LECTURE 8: ASYMPTOTICS I

1 Convergence in Probability and the Weak Law of Large Numbers

Sieve Estimators: Consistency and Rates of Convergence

Advanced Stochastic Processes.

On Weak and Strong Convergence Theorems for a Finite Family of Nonself I-asymptotically Nonexpansive Mappings

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Lecture 3. Properties of Summary Statistics: Sampling Distribution

Entropy and Ergodic Theory Lecture 5: Joint typicality and conditional AEP

arxiv: v1 [math.pr] 13 Oct 2011

EE 4TM4: Digital Communications II Probability Theory

( θ. sup θ Θ f X (x θ) = L. sup Pr (Λ (X) < c) = α. x : Λ (x) = sup θ H 0. sup θ Θ f X (x θ) = ) < c. NH : θ 1 = θ 2 against AH : θ 1 θ 2

An almost sure invariance principle for trimmed sums of random vectors

Element sampling: Part 2

LONG SNAKES IN POWERS OF THE COMPLETE GRAPH WITH AN ODD NUMBER OF VERTICES

Lecture 4. We also define the set of possible values for the random walk as the set of all x R d such that P(S n = x) > 0 for some n.

1 6 = 1 6 = + Factorials and Euler s Gamma function

Lecture 2. The Lovász Local Lemma

Machine Learning for Data Science (CS 4786)

Chapter 7 Isoperimetric problem

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Sequences and Series of Functions

A Note on Sums of Independent Random Variables

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables

Machine Learning Theory (CS 6783)

Estimation of the essential supremum of a regression function

The natural exponential function

Math Solutions to homework 6

Clases 7-8: Métodos de reducción de varianza en Monte Carlo *

ECE 330:541, Stochastic Signals and Systems Lecture Notes on Limit Theorems from Probability Fall 2002

Lecture 7: Properties of Random Samples

Math 2784 (or 2794W) University of Connecticut

Series III. Chapter Alternating Series

Notes 27 : Brownian motion: path properties

Lecture Stat Maximum Likelihood Estimation

Lecture 3 : Random variables and their distributions

3. Z Transform. Recall that the Fourier transform (FT) of a DT signal xn [ ] is ( ) [ ] = In order for the FT to exist in the finite magnitude sense,

Machine Learning Brett Bernstein

Estimation for Complete Data

arxiv: v1 [math.st] 17 Apr 2015

Monte Carlo Integration

Quantile regression with multilayer perceptrons.

5.1 Review of Singular Value Decomposition (SVD)

Supplementary Materials for Statistical-Computational Phase Transitions in Planted Models: The High-Dimensional Setting

ON MEAN ERGODIC CONVERGENCE IN THE CALKIN ALGEBRAS

STAT Homework 1 - Solutions

Lecture 27. Capacity of additive Gaussian noise channel and the sphere packing bound

Transcription:

Dimesio-free PAC-Bayesia bouds for the estimatio of the mea of a radom vector Olivier Catoi CREST CNRS UMR 9194 Uiversité Paris Saclay olivier.catoi@esae.fr Ilaria Giulii Laboratoire de Probabilités et Modèles Aléatoires Uiversité Paris Diderot giulii@math.uiv-paris-diderot.fr Abstract I this paper, we preset a ew estimator of the mea of a radom vector, computed by applyig some threshold fuctio to the orm. No asymptotic dimesio-free almost sub-gaussia bouds are proved uder weak momet assumptios, usig PAC-Bayesia iequalities. 1 Itroductio Estimatig the mea of a radom vector uder weak tail assumptios has attracted a lot of attetio recetly. A umber of properties have spurred the iterest for these ew results, where the empirical mea is replaced by a more robust estimator. Oe aspect is that it is possible to obtai a estimator with a sub-gaussia tail while assumig much weaker assumptios o the data, up to the fact of assumig oly the existece of a fiite covariace matrix. Aother appealig feature is that it is possible to obtai dimesio-free o asymptotic bouds that remai valid i a separable Hilbert space. Some importat refereces are Catoi [01] i the oe dimesioal case ad Misker [015] ad Lugosi ad Medelso [017] i the multidimesioal case. Buildig o the breakthrough of Misker [015], that uses a multidimesioal geeralizatio of the media of meas estimator, Joly et al. [017] ad Lugosi ad Medelso [017] propose successive improvemets of the media of meas approach to get a estimator with a geuie sub-gaussia dimesio-free tail boud, while still requirig oly the existece of the covariace matrix. I the mea time, the M-estimator approach of Catoi [01] has also bee geeralized to multidimesioal settigs through the use of matrix iequalities i Misker [016] ad Misker ad Wei [017]. Here we follow a differet route, based o a multidimesioal extesio of Catoi [01] usig PAC-Bayesia bouds. Our ew estimator is a simple modificatio of the empirical mea, where some threshold is applied to the orm of the sample vectors. Therefore, it is straightforward to compute, ad this is a strog poit of our approach, compared to others. Note also that we make here some compromise o the sharpess of the estimatio error boud, i order to simplify the defiitio ad computatio of the estimator. This compromise cosists i the presece of secod order terms, while the first order terms ca be made as close as desired to a true sub-gaussia boud with exact costats, as stated i Lugosi ad Medelso [017, eq. (1.1]. With a more ivolved estimator, a true sub-gaussia boud without secod order terms is possible ad will be described i a separate publicatio. Thresholdig the orm Cosider X R d, a radom vector, ad (X 1,..., X a sample made of idepedet copies of X. The questio is to estimate E(X from the sample, uder the assumptio that E ( X p <, for some p. 31st Coferece o Neural Iformatio Processig Systems (NIPS 017, Log Beach, CA, USA.

Cosider the threshold fuctio ψ(t = mi{t, 1}, t R +, ad for some positive real parameter λ to be chose later, itroduce the thresholded sample Y i = ψ( λ X i λ X i X i. Our estimator of m = E(X will simply be the thresholded empirical mea m = 1 Propositio.1 Itroduce the icreasig fuctios g 1 (t = 1 ( exp(t 1 ad g (t = (exp(t t t 1 t, t R, Y i. that are defied by cotiuity at t = 0 ad are such that g 1 (0 = g (0 = 1. Assume that E ( X < ad that we kow v such that sup θ S d E ( θ, X m v <, where S d = { θ R d, θ = 1 } is the uit sphere of R d. For some positive real parameter µ, put log(δ λ = µ, T = max { E ( X m, v }, av ( a = g (µ 1, b = exp(µg 1 µ av T log(δ With probability at least 1 δ, av log(δ m m + bt + if p 1 C p + if p/ p C p p/, where C p = 1 ( p ( p log(δ p/ sup E ( X p θ, X m, ad (µ av θ S d C p = 1 ( p ( p log(δ p/ E ( X p a log(δ m ( 1 + m. (µ av v Remarks 1. Note that i case E ( X < but E ( X p = for p >, we ca use the boud C 1 + C 1 log(δ (T + m + 8 log(δ µ a 7µ av E( X m ( ( a log(δ 1 log(δ (T + m 1 + m = O. v µ a Note also that if we take µ = 1/4 ad assume that δ exp(, the a 1. ad b 4. If moreover E ( X p+1 <, for some p > 1, we obtai with probability at least 1 δ that.4 v log(δ 4T m m + + C p + C p+1 p/, (p+1/ meaig that the tail distributio of m m has a sub-gaussia behavior, up to secod order terms. Remark that by takig µ small, we ca make a ad b as close as desired to 1, at the expese of the values of C p ad C p.

Proof The rest of the paper is devoted to the proof of Propositio.1. A elemetary computatio shows that the threshold fuctio ψ satisfies 0 1 ψ(t t p ( p p if, t R +, (1 t p 1 where o iteger values of the expoet p are allowed. Let Y = ψ( λ X X ad m = E(Y. We λ X ca decompose the estimatio error i directio θ ito θ, m m = θ, m m + 1 θ, Y i m, θ R d. ( Itroduce α = ψ( λ X λ X ad let us deal with the first term first. As 0 1 α λp X p ( p p θ, m m = E [ (α 1 θ, X ] = E [ (α 1 θ, X m ] + E(α 1 θ, m λ p ( p p if E( X p λ p ( p p θ, X m + if E ( X p θ, m, p 1 ( p ( where r = max{0, r} is the egative part of iteger r. Let us ow look at the secod term of the decompositio (. To gai uiformity i θ, we will use a PAC-Bayesia iequality ad the family of ormal distributios ρ θ = N ( θ, β I d, bearig o the parameter θ R d, where I d R d d is the idetity matrix of size d d, ad where β is a positive parameter to be chose later o. We will use the followig PAC-Bayesia iequality without recallig its proof, that is a simple cosequece of Catoi [004, eq. (5..1 page 159]: Lemma. For ay bouded measurable fuctio f : R d R d R, for ay probability measure π M+( 1 R d, for ay δ ]0, 1[, with probability at least 1 δ, for ay probability measure ρ M 1 +(R d, 1 f ( θ, X i dρ(θ [ ( log E exp ( f(θ, X ] dρ(θ + K(ρ, π + log(δ { ( log ρ/π dρ, whe ρ π, where K is the Kullback-Liebler divergece K(ρ, π = +, otherwise. Remarkig that 1 θ, Y i m = 1 θ, Y i m dρ θ (θ, usig π = ρ 0, ad takig ito accout the fact that K(ρ θ, ρ 0 = β θ /, we obtai as a cosequece of the previous lemma that with probability at least 1 δ, for ay θ S d, 1 θ, Y i m 1 ( ( log E exp µλ θ, Y m dρ θ (θ + β µλ µλ + log(δ µλ. I our settig f is ot bouded i θ, but the required extesio is valid as explaied i Catoi [004]. Sice the logarithm is cocave, ( ( log E exp µλ θ, Y m [ ( ( ] dρ θ (θ log E exp µλ θ, Y m dρ θ (θ [ ( = log E exp (µλ θ, Y m + µ λ Y m ], β where we have used the explicit expressio of the Laplace trasform of a Gaussia distributio., 3

To go further, remidig as a source of ispiratio the proof of Beett s iequality, let us itroduce the icreasig fuctios g 1 ad g defied i Propositio.1. These fuctios will be used to boud the expoetial fuctio by polyomials. More precisely, we will exploit the fact that whe t b, exp(t 1 + t + g (bt / ad exp(t 1 + g 1 (bt. From this, it results that if t b ad u c, exp(t + u exp(t ( 1 + g 1 (cu exp(t + g 1 (c exp(bu 1 + t + g (bt / + g 1 (c exp(bu. Legitimate values for b ad c will be deduced from the remark that λ Y 1, implyig λ m 1. Namely, i our cotext, we will use b = µ ad c = µ /β. These argumets put together lead to the iequality ( E exp (µλ θ, Y m + µ λ Y m β 1 + g (µ µ λ Replacig i the previous iequalities, we obtai E( θ, Y m + exp(µg 1 ( µ µ λ β β E ( Y m. Lemma.3 With probability at least 1 δ, for ay θ S d, θ, m m = 1 θ, Y i m g (µ µλ E( θ, Y m ( µ µλ + exp(µg 1 β β E( Y m + β + log(δ. µλ Remark that θ, Y m = θ, αx m = ( α θ, X m (1 α θ, m α θ, X m + (1 α θ, m θ, X m + (1 α θ, m. Therefore, usig iequality (1 ad the defiitio of α, E ( θ, Y m E ( θ, Y m E ( θ, X m + θ, m λ p ( p p if E ( X p. p Remark also that Y = g(x, where g is a cotractio (beig the projectio o a ball. Cosequetly E ( Y m = 1 E( Y 1 Y 1 E( X 1 X = E ( X m. I view of these remarks, the previous lemma traslates to ( ( µ Lemma.4 Let a = g µ ad b exp(µg1. β With probability at least 1 δ, for ay θ S d, θ, m m aµλ E( θ, X m + bµλ β E( X m + β + log(δ µλ + if p 1 λ p + if p ( p λ p p E ( X p θ, X m ( p p E ( X p( θ, m + aµλ θ, m. Propositio.1 follows by takig b as metioed there, λ = 1 log(δ, ad β = µ av bt log(δ T log(δ, so that the coditio o b is satisfied. av av 4

Refereces O. Catoi. Statistical Learig Theory ad Stochastic Optimizatio, Lectures o Probability Theory ad Statistics, École d Été de Probabilités de Sait-Flour XXXI 001, volume 1851 of Lecture Notes i Mathematics. Spriger, 004. pages 1 69. O. Catoi. Challegig the empirical mea ad empirical variace: a deviatio study. A. Ist. Heri Poicaré, 48(4:1148 1185, 01. E. Joly, G. Lugosi, ad R. I. Oliveira. O the estimatio of the mea of a radom vector. Electroic Joural of Statistics, 11:440 451, 017. G. Lugosi ad S. Medelso. Sub-gaussia estimators of the mea of a radom vector. Aals of Statistics, to appear, 017. S. Misker. Geometric media ad robust estimatio i Baach spaces. Beroulli, 4:308 335, 015. S. Misker. Sub-Gaussia estimators of the mea of a radom matrix with heavy-tailed etries. Aals of Statistics, to appear, 016. S. Misker ad X. Wei. Estimatio of the covariace structure of heavy-tailed distributios. I NIPS 017, to appear, 017. 5