Sub-Gaussian estimators under heavy tails

Size: px

Start display at page:

Download "Sub-Gaussian estimators under heavy tails"

Chester Day
5 years ago
Views:

1 Sub-Gaussian estimators under heavy tails Roberto Imbuzeiro Oliveira XIX Escola Brasileira de Probabilidade Maresias, August 6th 2015

2 Joint with Luc Devroye (McGill) Matthieu Lerasle (CNRS/Nice) Gábor Lugosi (ICREA/UPF)

3 Our problem (and why it's interesting)

4 Our problem We want to estimate the mean of a probability distribution over the real line from an i.i.d. sample. This is (related to) many fundamental statistical tasks.

5 Our problem We assume finite variances, but as little else as possible. Interesting in theory, important in practice.

6 Our problem Want nearly optimal tail bounds, uniformly over large classes of distributions. High-confidence estimates are sometimes necessary.

7 Formal statement Given: P, family of probability distributions over R. For P 2 P, µ P and P 2 are the mean and variance of P. Want: for each large enough n 2 N, an estimator E b n : R n! R and a parameter min = min,n 2 [0, 1) such that, if X1 n =(X 1,...,X n ) is i.i.d. from P 2 P, then r! 8 2 [ min, 1) : P E b 1+ln(1/ ) n (X1 n ) µ P >L P apple. n

8 Formal statement Should be very large (nonparametric) Given: P, family of probability distributions over R. For P 2 P, µ P and P 2 are the mean and variance of P. Want: for each large enough n 2 N, an estimator E b n : R n! R and a parameter min = min,n 2 [0, 1) such that, if X1 n =(X 1,...,X n ) is i.i.d. from P 2 P, then r! 8 2 [ min, 1) : P E b 1+ln(1/ ) n (X1 n ) µ P >L P apple. n

9 Formal statement Given: P, family of probability distributions over R. For P 2 P, µ P and P 2 are the mean and variance of P. Should be very small (exponentially in n?) Want: for each large enough n 2 N, an estimator E b n : R n! R and a parameter min = min,n 2 [0, 1) such that, if X1 n =(X 1,...,X n ) is i.i.d. from P 2 P, then r! 8 2 [ min, 1) : P E b 1+ln(1/ ) n (X1 n ) µ P >L P apple. n

10 Formal statement Given: P, family of probability distributions over R. For P 2 P, µ P and P 2 are the mean and variance of P. Want: for each large enough n 2 N, an estimator E b n : R n! R and a parameter min = min,n 2 [0, 1) such that, if X1 n =(X 1,...,X n ) is i.i.d. from P 2 P, then r! 8 2 [ min, 1) : P E b 1+ln(1/ ) n (X1 n ) µ P >L P apple. n Constant (may depend on the family)

11 Why sub-gaussian? What we ask for is basically that the estimator has Gaussian-like fluctuations around the mean. P b E n (X n 1 ) µ P > P p n apple C 1 e 2 C 2 Catoni: Gaussian-like fluctuations are optimal for "reasonable" families of distributions (more on this below).

12 Why is this interesting? Estimator must turn heavy tails into light tails! (Tail surgery?)

13 Why is this interesting?

14 Why is this interesting? Related (weaker) estimators have been applied to problems in Statistics and Machine Learning. Our notion could improve these results. Audibert and Catoni + Hsu and Sabato (least squares), Buyback et al. (bandits), Brownlees et al. (empirical risk minimization).

15 When is this possible? This is the main subject of our paper. We present our results before we move on.

16 Our results

17 First result Assumption: variance known up to an interval.

18 Partially known variance Example: P [ 2 1, 2 2 ] := all distributions with variance P 2 [ 1, 2 ]. We let R := 2 / 1 (may depend on n). Theorem: If R is bounded, then for all large enough n there exist E b n : R n! R, min e cn and L constant such that, when P 2 P [ 2 1, 2 2 ] 2 and X1 n = d P n, 8 2 [ min, 1) : P b E n (X n 1 ) µ P >L P r 1+ln(1/ ) n!. If R unbounded, any sequence min! 0 fails.

19 Partially known variance Example: P [ 2 1, 2 2 ] := all distributions with variance P 2 [ 1, 2 ]. We let R := 2 / 1 (may depend on n). Optimal up to the exact values of c>0 e L>0. Theorem: If R is bounded, then for all large enough n there exist E b n : R n! R, min e cn and L constant such that, when P 2 P [ 2 1, 2 2 ] 2 and X1 n = d P n, 8 2 [ min, 1) : P b E n (X n 1 ) µ P >L P r 1+ln(1/ ) n!. If R unbounded, any sequence min! 0 fails.

20 Partially known variance Example: P [ 2 1, 2 2 ] := all distributions with variance P 2 [ 1, 2 ]. We let R := 2 / 1 (may depend on n). Truly different behavior! Theorem: If R is bounded, then for all large enough n there exist E b n : R n! R, min e cn and L constant such that, when P 2 P [ 2 1, 2 2 ] 2 and X1 n = d P n, 8 2 [ min, 1) : P b E n (X n 1 ) µ P >L P r 1+ln(1/ ) n!. If R unbounded, any sequence min! 0 fails.

21 Second result Assumption: (slightly) higher moments.

22 Higher moments Example: P, := all distributions with E P X µ P apple ( P ), (here 2 (2, 3) is fixed, 0 may depend on n) Theorem: for all large enough n, ifk, := (C ) 2 /( 2), there exist b E n : R n! R, min e cn/k, and L constant such that, when P 2 P, and X n 1 = d P n, 8 2 [ min, 1) : P b E n (X n 1 ) µ P >L P r 1+ln(1/ ) n!.

23 Higher moments Example: P, := all distributions with Optimal up to value of c>0. E P X µ P apple ( P ), (here 2 (2, 3) is fixed, 0 may depend on n) Theorem: for all large enough n, ifk, := (C ) 2 /( 2), there exist b E n : R n! R, min e cn/k, and L constant such that, when P 2 P, and X n 1 = d P n, 8 2 [ min, 1) : P b E n (X n 1 ) µ P >L P r 1+ln(1/ ) n!.

24 An extension Suffices to assume that the distributions which is k- regular: 9k 2 N, 8P 2 P, 8j k :ifx j 1 = d P n,! P ± 1 jx 1 (X i µ P ) apple 0 j 3. i=1 For instance, symmetric distributions are 1-regular.

25 Third result Assumption: under bounded kurtosis, can get nearly optimal constant L = p 2+" This will be further discussed later.

26 Some background

27 History Typical analyses of estimators for means are based on expectations, not deviations. Exceptions do exist (eg. Kolmogorov s CLT for medians), but assumptions and goals are different.

28 History Catoni s paper (AIHP Prob. Stat. 2012) seems to be the first to focus on deviations as a fundamental problem. We ll mention some more applied results later.

29 Gaussian lower bound Recall normal cumulative distribution function. (r) := Z r 1 e x2 2 dx p 2 1 (1 ) p 2ln(1/ ) for 1.

30 Gaussian lower bound Family: P 2 Gauss, all Gaussian distributions over R with variance 2 > 0. Thm (Catoni): for any n, inf P be n sup P2P X n 1 = dp n Similar result for lower tail. be n (X n 1 ) µ P (1 ) 1 P p n =

31 Gaussian lower bound Family: P 2 Gauss, all Gaussian distributions over R with variance 2 > 0. Thm (Catoni): for any n, inf P be n sup P2P X n 1 = dp n Similar result for lower tail. This is asymptotic to L p ln(1/ )withl = p 2 be n (X n 1 ) µ P (1 ) 1 P p n =

32 Compare with definition Given: P, family of probability distributions over R. For P 2 P, µ P and P 2 are the mean and variance of P. Want: for each large enough n 2 N, an estimator E b n : R n! R and a parameter min = min,n 2 [0, 1) such that, if X1 n =(X 1,...,X n ) is i.i.d. from P 2 P, then r! 8 2 [ min, 1) : P E b 1+ln(1/ ) n (X1 n ) µ P >L P apple. n

33 Gaussian lower bound Family: P 2 Gauss, all Gaussian distributions over R with variance 2 > 0. Thm (Catoni): for any n, inf P be n sup P2P X n 1 = dp n Similar result for lower tail. be n (X n 1 ) µ P (1 ) 1 P p n =

34 The empirical mean nx be n (X n 1 ):= 1 n i=1 X i It follows from Catoni s result the empirical mean has optimal deviations for all Gaussian distributions. This is an exception, rather than the rule.

35 Empirical mean fails Example: P 2 2, all distributions with variances 2 P = 2. Thm (Catoni): Chebyshev is basically optimal. sup P2P 2 2 X n 1 = dp n P 1 n nx i=1! X i µ P > c P p n.

36 Empirical mean fails Example: P krtappleapple, all distributions with kurtosis apple P := E P X µ P 4 / P 4 apple apple. Thm (Catoni): If n is large and apple 1/n. sup P2P krtappleapple X n 1 = dp n P 1 n nx i=1 X i µ P > c P ( n) 1/4!.

37 Positive results Catoni obtained sharp sub-gaussian estimators in some settings. Unfortunately, they depend on the confidence level!

38 One example Example: P 2 2, all distributions with variance 2 P = 2. Thm (Catoni): Set min := e o(n).then8 2 [ min, 1), there exists a -dependent b E n, with r! sup P b n (2 + o(1)) ln(2/ ) En, (X1 ) µ P > P n P2P 2 2 X1 n= dp n apple

39 Why is this bad? Suppose you want high confidence. Only guarantee is that the probability of huge error is very low. Nothing is known about the probability of averageto-large error in more typical events.

40 Why is this bad? Statistical and machine learning applications (Bubeck et al., Brownlees et al., Hsu/Sabato) had to cope with this dependence on the confidence level. In all cases, something was lost.

41 Our results are better or rather, genuinely different. Our results imply that for parameter-dependent estimators are easier to obtain. We ll see that right now.

42 Median of means

43 Median of means Simple construction of a sub-gaussian parameterdependent estimator that only requires finite second moments. Known for a long time, in many forms, in different comunities (Nemirovski/Yudin, Alon/Matias/Szégedy, Levin, Jerrum/Sinclair, Hsu ). Pre-history.

44 Median of means Example: P 2 2, all distributions with variances 2 P = 2. Thm: Set min := e 1 n/2.then8 2 [ min, 1), there exists a -dependent b E n, with r! sup P b n 1+ln(2/ ) En, (X1 ) µ P >L P n P2P 2 2 X1 n= dp n apple

45 Median of means Sample: X n 1 := (X 1,X 2,X 3,...,X n ) from distribution P. Blocks: split {1, 2,...,n} = B 1 [ B 2 [ [ B b, disjoint blocks of size n/b. Means: for each block B`, define Median of means: Y` := b n X i2b` X i be n, (X n 1 ) := median of (Y 1,Y 2,...,Y b )

46 Analysis Interval µ P L P r b n µ P µ P + L P r b n R

47 Analysis Want: median of Y 1,...,Y b in interval. Su cient: more than half of the Y` s are in there. µ P L P r b n µ P µ P + L P r b n R

48 Analysis Y` = b n X X i, with the X i i.i.d. P i2b` E(Y`) =µ P, Var(Y`) =b 2 P/n µ P L P r b n µ P µ P + L P r b n R

49 Analysis By Chebyshev, P (Y` 62 interval) apple L 2 Disjoint blocks) events are independent. µ P L P r b n µ P µ P + L P r b n R

50 Analysis Probability that b/2 Y` s not in interval is bounded by a binomial tail probability. If L is large, P Bin(b, L 2 ) b/2 apple e b µ P L P r b n µ P µ P + L P r b n R

51 Analysis Probability that b/2 Y` s not in interval is bounded by a binomial tail probability. If L is large, P Bin(b, L 2 ) b/2 apple e b µ P L P r b n µ P µ P + L P r b n R b ln(1/ ) and we re done

52 Our proof ideas

53 Exponential is optimal Family: P La, all Laplace distributions La,with 2 R and dla (x) dx = e x 2 Property: e n apple dla n dla n 0 (x) apple e n Consequence: any estimator with constant L will mistake a La 0 sample for a La 10L sample 2 with prob. e 1 5L2n.

54 Partially known variance Example: P [ 2 1, 2 2 ] := all distributions with variance P 2 [ 1, 2 ]. We let R := 2 / 1 (may depend on n). Theorem: If R is bounded, then for all large enough n there exist E b n : R n! R, min e cn and L constant such that, when P 2 P [ 2 1, 2 2 ] 2 and X1 n = d P n, 8 2 [ min, 1) : P b E n (X n 1 ) µ P >L P r 1+ln(1/ ) n!. If R unbounded, any sequence min! 0 fails.

55 Why unbounded fails Family: P [c/n,r c/n] Po, Poisson random variables with very small means c/n apple µ P apple R c/n. Recall mean=variance for Poisson! X n 1 := sample with mean c/n, S X := X X n. Y n 1 := sample with mean R c/n, S Y := Y Y n.

56 Why unbounded fails X n 1 := sample with mean c/n, S X := X X n. Y n 1 := sample with mean R c/n, S Y := Y Y n. Assume good estimator b E n with constant L. P ne(y b 1 n ) Rc/2 1 e 1 Rc 4L 2 In particular, P ne(y b 1 n ) Rc/2 S Y = Rc 1.

57 Why unbounded fails X1 n := sample with mean c/n, S X := X X n. Y1 n Same for X as for Y! := sample with mean R c/n, S Y := Y Y n. (Sample sum is sufficient statistic) Assume good estimator b E n with constant L. P ne(y b 1 n ) Rc/2 1 e 1 Rc 4L 2 In particular, P ne(y b 1 n ) Rc/2 S Y = Rc 1.

58 Why unbounded fails P ne(x b 1 n ) So P ne(x b 1 n ) Rc/2 S X = Rc Rc/2 1. P (S X = Rc) e R ln Rc On the other hand, the prob. should be e R2 c L 2 by the sub-gaussian estimation property )( for R large

59 The positive result Example: P [ 2 1, 2 2 ] := all distributions with variance P 2 [ 1, 2 ]. We let R := 2 / 1 (may depend on n). Theorem: If R is bounded, then for all large enough n there exist E b n : R n! R, min e cn and L constant such that, when P 2 P [ 2 1, 2 2 ] 2 and X1 n = d P n, 8 2 [ min, 1) : P b E n (X n 1 ) µ P >L P r 1+ln(1/ ) n!. If R unbounded, any sequence min! 0 fails.

60 Confidence intervals Use median of means. Get a confidence interval. bi n, (X n 1 ):= " be n, (X n 1 ) ± L 2 r 1+ln(1/ ) n P µ P 2 b I n, (X n 1 ) and b I n, (X n 1 ) apple 2LR P r 1+ln(1/ ) n #! 1

61 Confidence intervals We'll combine sub-gaussian confidence intervals to obtain a single sub-gaussian estimator. Similar in spirit to Lepskii s adaptation method from nonparametric statistics.

62 Confidence intervals Lemma: I 1,I 2,...,I K random nonempty closed intervals. Assume µ 2 R, P (µ 62 I k ) apple 2 k,1apple k apple K. Set ˆK := min{k apple K : \ K j=k I j 6= ;}. Let b E :=midpoint of \ K j= ˆK I j. Then 81 apple k apple K : P E b µ > I k apple 2 1 k.

63 Proof sketch I 1,I 2,...,I K random nonempty closed intervals. Set ˆK := min{k apple K : \ K j=k I j 6= ;}. Let b E :=midpoint of \ K j= ˆK I j. Assume 8j k, µ 2 I j. Obtain, \ K j=k I j 6= ;, so ˆK apple k. Hence E,µ b 2 I k under the assumption. ) P E b µ > I k apple P j k P (µ 62 I j).

64 Other uses Example: P, := all distributions with E P X µ P apple ( P ), (here 2 (2, 3) is fixed, 0 may depend on n) Theorem: for all large enough n, ifk, := (C ) 2 /( 2), there exist b E n : R n! R, min e cn/k, and L constant such that, when P 2 P, and X n 1 = d P n, 8 2 [ min, 1) : P b E n (X n 1 ) µ P >L P r 1+ln(1/ ) n!.

65 Other uses Example: P, := all distributions with Use quantiles of means (instead of medians E P X µ P apple ( P ), of means) to build confidence intervals. (here 2 (2, 3) is fixed, 0 may depend on n) Barry-Esséen-type bounds prove that Theorem: empirical for all large means enough are n, nearly ifk, := symmetric. (C ) 2 /( 2), there exist E b n : R n! R, min e cn/k, and L constant such that, when P 2 P, and X1 n = d P n, 8 2 [ min, 1) : P b E n (X n 1 ) µ P >L P r 1+ln(1/ ) n!.

66 Different ideas - kurtosis Under bounded kurtosis, can use the empirical mean of truncated random variables. The truncation is data driven and uses preliminary estimates of mean and variance. Use empirical processes to show this is similar to truncating at the exact mean and variance. Sharp bounds!

67 Open problems

68 Open problems Sharp constants are essential for statisticians. Are sub-gaussian confidence intervals somehow equivalent to sub-gaussian estimators? Efficient extensions to vector-valued data and to risk minimization problems. Optimal deviation bounds for Poissons, Bernoullis, etc.

69 Obrigado! (references in the next slides)

70 Our preprint Should be posted to the arxiv in some weeks. Available upon request from roboliv AT gmail.com

71 Catoni s work Catoni s estimation paper + companion paper on least squares (with Audibert). J.-Y. Audibert & O. Catoni. "Robust linear least squares regression. Ann. Stat. 39 no. 5 (2011) O. Catoni. "Challenging the empirical mean and empirical variance: A deviation study. Ann. Inst. H. Poincaré Probab. Statist. 48 no. 4 (2012)

72 Median of means D. Hsu robust-statistics.html (See also Levin, L. "Notes for Miscellaneous Lectures. arxiv:cs/ ) N. Alon, Y. Matias & M. Szégedy. "The Space Complexity of Approximating the Frequency Moments." J. Comput. Syst. Sci. 58 no. 1 (1999) A. Nemirovski & D. Yudin. Problem complexity and method efficiency in optimization. Wiley (1983).

73 Some applications C. Brownlees, E. Joly & G. Lugosi. "Empirical risk minimization for heavy-tailed losses. To appear in Ann. Stat. S. Bubeck, N. Cesa-Bianchi & G. Lugosi. Bandits with heavy tail. IEEE Transactions on Information Theory 59 no. 11 (2013) D. Hsu & S. Sabato. "Loss minimization and parameter estimation with heavy tails. arxiv: Abstract in ICML proceedings (2014).

Distributed Statistical Estimation and Rates of Convergence in Normal Approximation

Distributed Statistical Estimation and Rates of Convergence in Normal Approximation Stas Minsker (joint with Nate Strawn) Department of Mathematics, USC July 3, 2017 Colloquium on Concentration inequalities,