Efficient and Robust Scale Estimation

Size: px

Start display at page:

Download "Efficient and Robust Scale Estimation"

Andrew Silvester Barrett
6 years ago
Views:

1 Efficient and Robust Scale Estimation Garth Tarr, Samuel Müller and Neville Weber School of Mathematics and Statistics THE UNIVERSITY OF SYDNEY

2 Outline Introduction and motivation The robust scale estimator P n Properties of P n in finite samples Summary and key references

3 History of relative efficiencies at the Gaussian (n = 20) SD Relative Efficiency Year

4 History of relative efficiencies at the Gaussian (n = 20) SD Relative Efficiency IQR MAD Year

5 History of relative efficiencies at the Gaussian (n = 20) SD Relative Efficiency IQR MAD Q n S n Year

6 History of relative efficiencies at the Gaussian (n = 20) 1.00 SD 0.85 P n Relative Efficiency IQR MAD Q n S n Year

7 U-quantile statistics Given data X = (X 1,..., X n ) and a symmetric kernel h : R 2 R a U-statistic of order 2 is defined as: ( ) n 1 U n (X) := h(x i, X j ). (1) 2 i<j

8 U-quantile statistics Given data X = (X 1,..., X n ) and a symmetric kernel h : R 2 R a U-statistic of order 2 is defined as: ( ) n 1 U n (X) := h(x i, X j ). (1) 2 i<j Let H(t) = P (h(x i, X j ) t) be the cdf of the kernels with corresponding empirical distribution function, H n (t) := ( ) n 1 I{h(X i, X j ) t}, for t R. (2) 2 i<j

9 U-quantile statistics Given data X = (X 1,..., X n ) and a symmetric kernel h : R 2 R a U-statistic of order 2 is defined as: ( ) n 1 U n (X) := h(x i, X j ). (1) 2 i<j Let H(t) = P (h(x i, X j ) t) be the cdf of the kernels with corresponding empirical distribution function, H n (t) := ( ) n 1 I{h(X i, X j ) t}, for t R. (2) 2 i<j For 0 < p < 1, the corresponding sample U-quantile is: H 1 n (p) := inf{t : H n (t) p}. (3)

10 Generalized L-statistics A generalized linear (GL) statistic can be defined as where T n (H n ) = I J(p)H 1 n (p)dp + d j=1 J is function for smooth weighting of Hn 1 (p) I [0, 1] is some interval a j are discrete coefficients for H 1 n (p j ) a j H 1 n (p j ). (Serfling, 1984)

11 Examples of GL-statistics Interquartile range: h(x) = x, IQR = Hn 1 (0.75) Hn 1 (0.25)

12 Examples of GL-statistics Interquartile range: h(x) = x, IQR = Hn 1 (0.75) Hn 1 (0.25) Variance: h(x, y) = 1 2 (x y)2, 1 0 Hn 1 (p)dp

13 Examples of GL-statistics Interquartile range: h(x) = x, IQR = Hn 1 (0.75) Hn 1 (0.25) Variance: h(x, y) = 1 2 (x y)2, 1 0 Hn 1 (p)dp Winsorized variance: h(x, y) = 1 2 (x y)2, Hn 1 (p)dp Hn 1 (0.75)

14 Examples of GL-statistics Interquartile range: h(x) = x, IQR = Hn 1 (0.75) Hn 1 (0.25) Variance: h(x, y) = 1 2 (x y)2, 1 0 Hn 1 (p)dp Winsorized variance: h(x, y) = 1 2 (x y)2, Hn 1 (p)dp Hn 1 (0.75) Rousseeuw and Croux s Q n : h(x, y) = x y, H 1 n (0.25)

15 Outline Introduction and motivation The robust scale estimator P n Properties of P n in finite samples Summary and key references

16 Pairwise mean scale estimator: P n Consider the set of ( n 2) pairwise means: {h(x i, X j ), 1 i < j n} where h(x 1, X 2 ) = (X 1 + X 2 )/2.

17 Pairwise mean scale estimator: P n Consider the set of ( n 2) pairwise means: {h(x i, X j ), 1 i < j n} where h(x 1, X 2 ) = (X 1 + X 2 )/2. Let H n be the corresponding empirical distribution function: ( ) n 1 H n (t) := I{h(X i, X j ) t}, for t R. 2 i<j

18 Pairwise mean scale estimator: P n Consider the set of ( n 2) pairwise means: {h(x i, X j ), 1 i < j n} where h(x 1, X 2 ) = (X 1 + X 2 )/2. Let H n be the corresponding empirical distribution function: ( ) n 1 H n (t) := I{h(X i, X j ) t}, for t R. 2 Definition P n is defined as i<j P n = c [ Hn 1 (0.75) Hn 1 (0.25) ], where c is a correction factor to make P n consistent for the standard deviation when the underlying observations are Gaussian.

19 Influence curve The influence curve for a functional T at distribution F is T ((1 ɛ)f + ɛδ x ) T (F ) IC(x; T, F ) = lim ɛ 0 ɛ where δ x has all its mass at x. Serfling (1984) outlines the IC for GL-statistics.

20 Influence curve The influence curve for a functional T at distribution F is T ((1 ɛ)f + ɛδ x ) T (F ) IC(x; T, F ) = lim ɛ 0 ɛ where δ x has all its mass at x. Serfling (1984) outlines the IC for GL-statistics. Influence curve for P n Assuming that F has derivative f > 0 on [F 1 (ɛ), F 1 (1 ɛ)] for all ɛ > 0, [ 0.75 F (2H 1 F (0.75) x) IC(x; P n, F ) = c f(2h 1 F (0.75) x)f(x)dx ] 0.25 F (2H 1 F (0.25) x) f(2h 1. F (0.25) x)f(x)dx

21 Influence curves when F = Φ P n IC(x; T, F ) x

22 Influence curves when F = Φ SD P n IC(x; T, F ) x

23 Influence curves when F = Φ SD P n IC(x; T, F ) MAD x

24 Influence curves when F = Φ SD P n IC(x; T, F ) Q n MAD x

25 Asymptotic normality The empirical U-process is ( n (Hn (t) H(t)) ) t R.

26 Asymptotic normality The empirical U-process is ( n (Hn (t) H(t)) ) t R. Silverman (1976) proved that in this context, n(hn ( ) H( )) converges weakly to an almost sure continuous zero-mean Gaussian process W with covariance function: EW (s)w (t) = 4P (h(x 1, X 2 ) s, h(x 1, X 3 ) t) 4H(s)H(t).

27 Asymptotic normality The empirical U-process is ( n (Hn (t) H(t)) ) t R. Silverman (1976) proved that in this context, n(hn ( ) H( )) converges weakly to an almost sure continuous zero-mean Gaussian process W with covariance function: EW (s)w (t) = 4P (h(x 1, X 2 ) s, h(x 1, X 3 ) t) 4H(s)H(t). For 0 < p < q < 1, if H, the derivative of H, is strictly positive on the interval [H 1 (p) ε, H 1 (q) + ε] for some ε > 0, then we can use the inverse map to show n ( H 1 n ( ) H 1 ( ) ) D W (H 1 ( )) H (H 1 ( )).

28 Asymptotic normality Recall, P n = c [ Hn 1 (3/4) Hn 1 (1/4) ].

29 Asymptotic normality Recall, P n = c [ Hn 1 (3/4) Hn 1 (1/4) ]. Hence, where n(pn θ) D N (0, V ) θ = c ( H 1 (3/4) H 1 (1/4) ) and V both depend on the underlying distribution.

30 Asymptotic variance and relative efficiency The asymptotic variance follows from the previous limit theorem or from the expected square of the influence function.

31 Asymptotic variance and relative efficiency The asymptotic variance follows from the previous limit theorem or from the expected square of the influence function. When the underlying data are Gaussian, numerical integration yields, V = IC(x, P n, Φ) 2 dφ(x) =

32 Asymptotic variance and relative efficiency The asymptotic variance follows from the previous limit theorem or from the expected square of the influence function. When the underlying data are Gaussian, numerical integration yields, V = IC(x, P n, Φ) 2 dφ(x) = This equates to an asymptotic efficiency of 0.86 as compared with 0.82 for Q n and 0.37 for the MAD at the normal.

33 Asymptotic variance and relative efficiency The asymptotic variance follows from the previous limit theorem or from the expected square of the influence function. When the underlying data are Gaussian, numerical integration yields, V = IC(x, P n, Φ) 2 dφ(x) = This equates to an asymptotic efficiency of 0.86 as compared with 0.82 for Q n and 0.37 for the MAD at the normal. Result We have a robust estimator that is asymptotically more efficient than Q n at the normal. But how does it compare at heavier tailed distributions?

34 Asymptotic relative efficiency when f = t ν for ν [1, 10] Asymptotic relative efficiency P n Degrees of freedom

35 Asymptotic relative efficiency when f = t ν for ν [1, 10] Asymptotic relative efficiency MAD Degrees of freedom P n

36 Asymptotic relative efficiency when f = t ν for ν [1, 10] Asymptotic relative efficiency P n Q n MAD Degrees of freedom

37 Discrete distributions For P n = 0, the interquartile range of the pairwise means must be equal to zero.

38 Discrete distributions For P n = 0, the interquartile range of the pairwise means must be equal to zero. I.e. more than 50% of the pairwise means must be equal.

39 Discrete distributions For P n = 0, the interquartile range of the pairwise means must be equal to zero. I.e. more than 50% of the pairwise means must be equal. For a discrete random variable with possible outcomes, x 1, x 2,..., and P(X = x j ) = p j, for j = 1, 2,..., the expected proportion of pairwise differences equal to zero is j p2 j. In particular, p 2 j > 0.25 lim P(Q n = 0) = 1. n j

40 Discrete distributions For P n = 0, the interquartile range of the pairwise means must be equal to zero. I.e. more than 50% of the pairwise means must be equal. For a discrete random variable with possible outcomes, x 1, x 2,..., and P(X = x j ) = p j, for j = 1, 2,..., the expected proportion of pairwise differences equal to zero is j p2 j. In particular, p 2 j > 0.25 lim P(Q n = 0) = 1. n j Example (X Poisson(1)) For the Poisson(1), j p2 j 0.31 and so in the limit, Q n = 0 with high probability, whereas c 1 P n = 1.

41 Discrete distributions In finite samples, apart from some trivial cases, P n = 0 = Q n = 0.

42 Discrete distributions In finite samples, apart from some trivial cases, P n = 0 = Q n = 0. Example (X Binomial(6, 0.4)) In the limit neither Q n nor P n will converge to zero. In samples of size n = 20, Q n will return a scale estimate of zero, on average 12% of the time. In contrast, P n returns zero less than 0.1% of the time.

43 Adaptive trimming: Pn (Potential) Cons with P n P n has a breakdown value of 13%. P n is not very efficient at the Cauchy.

44 Adaptive trimming: Pn (Potential) Cons with P n P n has a breakdown value of 13%. P n is not very efficient at the Cauchy. For preliminary high breakdown location and scale estimates, m(x) and s(x) respectively, an observation, X i, is trimmed if X i m(x) s(x) where d is the tuning parameter. > d, (4)

45 Adaptive trimming: Pn (Potential) Cons with P n P n has a breakdown value of 13%. P n is not very efficient at the Cauchy. For preliminary high breakdown location and scale estimates, m(x) and s(x) respectively, an observation, X i, is trimmed if X i m(x) s(x) > d, (4) where d is the tuning parameter. Achieves the best possible breakdown value for a sensible choice of tuning parameter.

46 Adaptive trimming: Pn (Potential) Cons with P n P n has a breakdown value of 13%. P n is not very efficient at the Cauchy. For preliminary high breakdown location and scale estimates, m(x) and s(x) respectively, an observation, X i, is trimmed if X i m(x) s(x) > d, (4) where d is the tuning parameter. Achieves the best possible breakdown value for a sensible choice of tuning parameter. Definition Denote P n as the adaptively trimmed P n with d = 5.

47 Outline Introduction and motivation The robust scale estimator P n Properties of P n in finite samples Summary and key references

48 Efficiency of P n in finite samples Following Randal (2008) efficiencies are estimated over m independent samples as êff(t ) = Var (ln ˆσ 1,..., ln ˆσ m ) Var (ln T (X 1 ),..., ln T (X m )). (5)

49 Efficiency of P n in finite samples Following Randal (2008) efficiencies are estimated over m independent samples as êff(t ) = For each i = 1, 2,..., m, Var (ln ˆσ 1,..., ln ˆσ m ) Var (ln T (X 1 ),..., ln T (X m )). (5) X i are independent samples of size n, ˆσ i is the ML scale estimate, and T (X i ) is the proposed scale estimate.

50 Efficiency of P n in finite samples Following Randal (2008) efficiencies are estimated over m independent samples as êff(t ) = For each i = 1, 2,..., m, Var (ln ˆσ 1,..., ln ˆσ m ) Var (ln T (X 1 ),..., ln T (X m )). (5) X i are independent samples of size n, ˆσ i is the ML scale estimate, and T (X i ) is the proposed scale estimate. Distributions considered t distributions with degrees of freedom between 1 and 10. Configural polysampling using Tukey s 3 corners: Gaussian, One-wild and Slash.

51 Relative efficiencies: f = t ν for ν [1, 10] and n = 20 Estimated relative efficiency P n Degrees of freedom

52 Relative efficiencies: f = t ν for ν [1, 10] and n = 20 Estimated relative efficiency P n Degrees of freedom P n

53 Relative efficiencies: f = t ν for ν [1, 10] and n = 20 Estimated relative efficiency Degrees of freedom P n P n Q n

54 Relative efficiencies at the Slash corner (n = 20) P n 48% P n 75% 10% trim P n Q n 95% S n MAD 87% IQR Estimated relative efficiency

55 Relative efficiencies at the One-wild corner (n = 20) P n 84% P n 72% 10% trim P n Q n 68% S n MAD 40% IQR Estimated relative efficiency

56 Relative efficiencies at the Gaussian corner (n = 20) P n P n 81% 85% 10% trim P n Q n 67% S n MAD 38% IQR Estimated relative efficiency

57 Outline Introduction and motivation The robust scale estimator P n Properties of P n in finite samples Summary and key references

58 Summary 1. Aim An efficient, robust and widely applicable scale estimator.

59 Summary 1. Aim An efficient, robust and widely applicable scale estimator. 2. Method P n scale estimator a GL-statistic.

60 Summary 1. Aim An efficient, robust and widely applicable scale estimator. 2. Method P n scale estimator a GL-statistic. 3. Results 86% asymptotic efficiency at the normal. Highly efficient even when the underlying distribution has quite heavy tails. Less likely to fail for discrete distributions than Q n.

61 References Hampel, F. (1974). The influence curve and its role in robust estimation. Journal of the American Statistical Association, 69(346): Randal, J. (2008). A reinvestigation of robust scale estimation in finite samples. Computational Statistics & Data Analysis, 52(11): Rousseeuw, P. and Croux, C. (1993). Alternatives to the median absolute deviation. Journal of the American Statistical Association, 88(424): Serfling, R. J. (1984). Generalized L-, M-, and R-statistics. The Annals of Statistics, 12(1): Silverman, B. (1976). Limit theorems for dissociated random variables. Advances in Applied Probability, 8(4):

Robust estimation of scale and covariance with P n and its application to precision matrix estimation

Robust estimation of scale and covariance with P n and its application to precision matrix estimation Garth Tarr, Samuel Müller and Neville Weber USYD 2013 School of Mathematics and Statistics THE UNIVERSITY