Distributed Statistical Estimation and Rates of Convergence in Normal Approximation

Size: px

Start display at page:

Download "Distributed Statistical Estimation and Rates of Convergence in Normal Approximation"

Anne Fisher
5 years ago
Views:

1 Distributed Statistical Estimation and Rates of Convergence in Normal Approximation Stas Minsker (joint with Nate Strawn) Department of Mathematics, USC July 3, 2017 Colloquium on Concentration inequalities, High-dimensional Statistics, and Stein s method

2 Challenges of Contemporary Statistics Resource limitations: massive data need computer clusters for storage and efficient processing

3 Challenges of Contemporary Statistics Resource limitations: massive data need computer clusters for storage and efficient processing = requires algorithms that can be implemented in parallel. Node 1 Node 2... Node k Master

4 Challenges of Contemporary Statistics Node 2 Node 1 Resource limitations: massive data need computer clusters for storage and efficient processing = requires algorithms that can be implemented in parallel. Presence of outliers of unknown nature... Node k Master

5 Challenges of Contemporary Statistics Node 2 Node 1 Resource limitations: massive data need computer clusters for storage and efficient processing = requires algorithms that can be implemented in parallel. Presence of outliers of unknown nature = requires algorithms that are robust and do not rely on preprocessing or outlier detection.... Node k Master

Challenges of Contemporary Statistics Node 2 Node 1 Resource limitations: massive data need computer clusters for storage and efficient processing = requires algorithms that can be implemented in

6 Challenges of Contemporary Statistics Node 2 Node 1 Resource limitations: massive data need computer clusters for storage and efficient processing = requires algorithms that can be implemented in parallel. Presence of outliers of unknown nature = requires algorithms that are robust and do not rely on preprocessing or outlier detection. While ad-hoc techniques exist for some problems, we would like to develop general methods.... Node k Master

7 Parallel algorithms

8 Parallel algorithms Data Subset 1 Subset k

9 Parallel algorithms Data Subset 1 Subset k "Embarrasingly parellel": no communication Communication allowed

10 Parallel algorithms Data Subset 1 Subset k "Embarrasingly parellel": no communication Communication allowed De-bias take average Compute the spatial median? Versions of gradient descent Very general, but is not robust Requires estmators to be asymptotically normal

11 Example: how to estimate the mean? Assume that X 1,..., X N are i.i.d. N (µ, σ 2 ). Problem: construct CI norm(α) for µ with coverage probability 1 2α.

12 Example: how to estimate the mean? Assume that X 1,..., X N are i.i.d. N (µ, σ 2 ). Problem: construct CI norm(α) for µ with coverage probability 1 2α. N Solution: compute ˆµ := 1 X N j, take j=1 [ CI norm(α) = ˆµ σ log(1/α) 2, ˆµ + σ ] log(1/α) 2 N N

13 Example: how to estimate the mean? Assume that X 1,..., X N are i.i.d. N (µ, σ 2 ). Problem: construct CI norm(α) for µ with coverage probability 1 2α. N Solution: compute ˆµ := 1 X N j, take j=1 To find ˆµ: set m = N/k, [ CI norm(α) = ˆµ σ log(1/α) 2, ˆµ + σ ] log(1/α) 2 N N X 1,..., X m X N m+1,..., X N }{{}}{{} µ 1 := m 1 m X N j X j j=1 µ k := m 1 j=n m+1 } {{ } k ˆµ j ˆµ= k 1 j=1

14 Example: how to estimate the mean? N Solution: compute ˆµ := 1 X N j, take j=1 [ CI norm(α) = ˆµ σ log(1/α) 2, ˆµ + σ ] log(1/α) 2 N N To find ˆµ: set m = N/k, X 1,..., X m X N m+1,..., X N }{{}}{{} µ 1 := m 1 m X N j X j j=1 µ k := m 1 j=n m+1 } {{ } k ˆµ j ˆµ= k 1 j=1 Averaging works in many scenarios where estimators have small bias (J. Fan, H. Liu et al., J. Lee, J. Taylor et al., Y. Zhang, J. Duchi, M. Wainwright)

15 Example: how to estimate the mean? P. J. Huber (1964):...This raises a question which could have been asked already by Gauss, but which was, as far as I know, only raised a few years ago (notably by Tukey): what happens if the true distribution deviates slightly from the assumed normal one?" Going back to our question: what if X, X 1,..., X N are i.i.d. from Π with EX = µ, Var(X) = σ 2?

16 Example: how to estimate the mean? P. J. Huber (1964):...This raises a question which could have been asked already by Gauss, but which was, as far as I know, only raised a few years ago (notably by Tukey): what happens if the true distribution deviates slightly from the assumed normal one?" Going back to our question: what if X, X 1,..., X N are i.i.d. from Π with EX = µ, Var(X) = σ 2? Problem: construct CI for µ with coverage probability 1 α such that for any α length(ci(α)) (Absolute constant) length(ci norm(α)) No additional assumptions on Π are imposed.

17 Example: how to estimate the mean? P. J. Huber (1964):...This raises a question which could have been asked already by Gauss, but which was, as far as I know, only raised a few years ago (notably by Tukey): what happens if the true distribution deviates slightly from the assumed normal one?" Going back to our question: what if X, X 1,..., X N are i.i.d. from Π with EX = µ, Var(X) = σ 2? Problem: construct CI for µ with coverage probability 1 α such that for any α length(ci(α)) (Absolute constant) length(ci norm(α)) No additional assumptions on Π are imposed. N Remark: guarantees for the sample mean ˆµ N = 1 X N j is unsatisfactory: j=1 ( Pr ˆµN µ ) (1/α) σ α. N Does the solution exist?

18 Example: how to estimate the mean? Answer: Yes!

19 Example: how to estimate the mean? Answer: Yes! Construction: [A. Nemirovski, D. Yudin 83; N. Alon, Y. Matias, M. Szegedy 96; R. Oliveira, M. Lerasle 11] Split the sample into k = log(1/α) + 1 groups G 1,..., G k of size N/k each: G 1 {}}{ X 1,..., X G1 }{{} µ 1 := 1 X G 1 i X i G 1 G k {}}{ X N Gk +1,..., X N }{{} µ k := 1 X G k i X i G k }{{} µ (k) :=median( ˆµ 1,..., ˆµ k )

20 Example: how to estimate the mean? Answer: Yes! Construction: [A. Nemirovski, D. Yudin 83; N. Alon, Y. Matias, M. Szegedy 96; R. Oliveira, M. Lerasle 11] Split the sample into k = log(1/α) + 1 groups G 1,..., G k of size N/k each: G 1 {}}{ X 1,..., X G1 }{{} µ 1 := 1 X G 1 i X i G 1 G k {}}{ X N Gk +1,..., X N }{{} µ k := 1 X G k i X i G k }{{} µ (k) :=median( ˆµ 1,..., ˆµ k ) Claim: ( ) log(1/α) Pr µ (k) µ 6.5 σ α N

21 Example: how to estimate the mean? Answer: Yes! Construction: [A. Nemirovski, D. Yudin 83; N. Alon, Y. Matias, M. Szegedy 96; R. Oliveira, M. Lerasle 11] Split the sample into k = log(1/α) + 1 groups G 1,..., G k of size N/k each: G 1 {}}{ X 1,..., X G1 }{{} µ 1 := 1 X G 1 i X i G 1 G k {}}{ X N Gk +1,..., X N }{{} µ k := 1 X G k i X i G k }{{} µ (k) :=median( ˆµ 1,..., ˆµ k ) Claim: Then take ( ) log(1/α) Pr µ (k) µ 6.5 σ α N [ ] log(1/α) log(1/α) CI(α) = µ (k) 6.5σ, µ (k) + 6.5σ N N

22 k = log(1/α) + 1 ( ) log(1/α) Pr µ (k) µ 6.5σ α N

23 k = log(1/α) + 1 ( ) log(1/α) Pr µ (k) µ 6.5σ α N We would like k to be large, for example k N;

24 k = log(1/α) + 1 ( ) log(1/α) Pr µ (k) µ 6.5σ α N We would like k to be large, for example k N; In this case, µ (k) µ N 1/4 with probability 1 e N ;

25 k = log(1/α) + 1 ( ) log(1/α) Pr µ (k) µ 6.5σ α N We would like k to be large, for example k N; In this case, µ (k) µ N 1/4 with probability 1 e N ; If we would like the confidence to be 95%, k = 5;

26 k = log(1/α) + 1 ( ) log(1/α) Pr µ (k) µ 6.5σ α N We would like k to be large, for example k N; In this case, µ (k) µ N 1/4 with probability 1 e N ; If we would like the confidence to be 95%, k = 5; Is the problem with the construction, or existing bounds are suboptimal?

27 Simulation results N = 2 16 N = 2 18 N = Median Error log N k

28 Under additional mild assumptions, existing results can be improved

29 {P θ, θ R};

30 {P θ, θ R}; X 1,..., X N i.i.d. from P θ ;

31 {P θ, θ R}; X 1,..., X N i.i.d. from P θ ; X 1,..., X m }{{} θ 1 = θ 1 (X 1,...,X m) X N m+1,..., X N }{{} θ k = θ k (X N m+1,...,x N ) } {{ } θ (k) =median( θ 1,..., θ k )

32 {P θ, θ R}; X 1,..., X N i.i.d. from P θ ; X 1,..., X m }{{} θ 1 = θ 1 (X 1,...,X m) X N m+1,..., X N }{{} θ k = θ k (X N m+1,...,x N ) } {{ } θ (k) =median( θ 1,..., θ k ) Assumption: there exists a sequence {σ n} n 1 such that ( ) g(n) := sup θ1 P θ t Φ(t) t R σ 0 as n. n

33 Assumption: there exists a sequence {σ n} n 1 such that ( g(n) := sup θ1 P θ t R σ n ) t Φ(t) 0 as n. Theorem (M., Strawn) For all s k, ) θ (k) s θ 3σn (g(n) + k with probability 1 4e 2s.

34 Assumption: there exists a sequence {σ n} n 1 such that ( g(n) := sup θ1 P θ t R σ n ) t Φ(t) 0 as n. Theorem (M., Strawn) For all s k, ) θ (k) s θ 3σn (g(n) + k with probability 1 4e 2s. Example: θ = EX, θ j is the sample mean over subgroup the G j, σ n = σ n (CLT)

35 ( g(n) := sup θ P 1 θ t R σ n ) t Φ(t) 0 as n. Theorem (M., Strawn) For all s k, ) θ (k) s θ 3σn (g(n) + k with probability 1 4e 2s. Example: θ = EX, θ j is the sample mean over subgroup the G j, σ n = σ n (CLT) Assume that E X EX 3 <. Then (by Berry-Esseen theorem) θ (k) θ 3σ with probability 1 4e 2s. E X EX 3 g(n) 0.5 σ 3, n ( E X θ 3 σ 3 k N k + s N k )

36 Example: θ = EX, θ j is the sample mean over subgroup the G j, σ n = σ n (CLT) Assume that E X EX 3 <. Then (by Berry-Esseen theorem) θ (k) θ 3σ with probability 1 4e 2s. Implies optimal rates whenever k N. E X EX 3 g(n) 0.5 σ 3, n ( E X θ 3 σ 3 k N k + s N k )

37 Example: Distributed Maximum Likelihood Estimation X 1,..., X N P θ, dp θ dx = p θ (x);

38 Example: Distributed Maximum Likelihood Estimation X 1,..., X N P θ, dp θ = p dx θ (x); Regularity (smoothness) assumptions for {p θ, θ Θ R} are satisfied;

39 Example: Distributed Maximum Likelihood Estimation X 1,..., X N P θ, dp θ = p dx θ (x); Regularity (smoothness) assumptions for {p θ, θ Θ R} are satisfied; Then for all s k, with probability 1 e s. θ (k) θ ( Const ) k s I(θ ) N + N

40 Example: Distributed Maximum Likelihood Estimation X 1,..., X N P θ, dp θ = p dx θ (x); Regularity (smoothness) assumptions for {p θ, θ Θ R} are satisfied; Then for all s k, with probability 1 e s. θ (k) θ ( Const ) k s I(θ ) N + N Asymptotic normality has been treated recently by G. Reinert and A. Anastasiou (2014), I. Pinelis (2016).

41 Proof of the theorem:

42 Connections to U-quantiles Previously described estimators depend on the specific partition of the data.

43 Connections to U-quantiles Previously described estimators depend on the specific partition of the data. To avoid such dependence, consider θ (k) = med ( θj, J A (n) N ), A (n) := {J : J {1,..., N}, Card(J) = n := N/k }; N θ J := θ(x j, j J) is an estimator of θ based on {X j, j J}.

44 Connections to U-quantiles Previously described estimators depend on the specific partition of the data. To avoid such dependence, consider θ (k) = med ( θ J, J A (n) N ), A (n) := {J : J {1,..., N}, Card(J) = n := N/k }; N θj := θ(x j, j J) is an estimator of θ based on {X j, j J}. Guarantees for θ (k) are at least as good as for θ (k) : Theorem (M., Strawn) For all s k, ) θ (k) s θ 3σn (g(n) + k with probability 1 4e 2s.

45 Extension to higher dimensions Assume that X R d, EX = θ, E(X EX)(X EX) T = Σ.

46 Extension to higher dimensions Assume that X R d, EX = θ, E(X EX)(X EX) T = Σ. "Naive approach" estimate each coordinate of θ separately: x = argmin y R d k y x j 1. j=1

47 Extension to higher dimensions Assume that X R d, EX = θ, E(X EX)(X EX) T = Σ. "Naive approach" estimate each coordinate of θ separately: x = argmin y R d k y x j 1. It follows from the previous results and the union bound that θ (k) 2 θ 3 E X (j) θ (j) 3 k s tr Σ max j=1...d Σ 3/2 N k + N k j,j with probability 1 4de 2s. j=1

48 Extension to higher dimensions Assume that X R d, EX = θ, E(X EX)(X EX) T = Σ. "Naive approach" estimate each coordinate of θ separately: x = argmin y R d k y x j 1. It follows from the previous results and the union bound that θ (k) 2 θ 3 E X (j) θ (j) 3 k s tr Σ max j=1...d Σ 3/2 N k + N k j,j with probability 1 4de 2s. Estimator is not invariant with respect to orthogonal transformations. j=1

49 Extension to higher dimensions Definition The geometric median is defined as x = med (x 1,..., x k ) := argmin y R d k y x j 2. j=1

50 Extension to higher dimensions Definition The geometric median is defined as x = med (x 1,..., x k ) := argmin y R d k y x j 2. j=1 Remarks: 1 x convex hull(x 1,..., x k ). y x

51 Extension to higher dimensions Definition The geometric median is defined as x = med (x 1,..., x k ) := argmin y R d k y x j 2. j=1 Remarks: 1 x convex hull(x 1,..., x k ). 2 x can be numerically approximated using Weiszfeld s algorithm. y x

52 Extension to higher dimensions The geometric median x is defined as x = med (x 1,..., x k ) := argmin y R d k y x j 2. j=1

53 Extension to higher dimensions The geometric median x is defined as x = med (x 1,..., x k ) := argmin y R d k y x j 2. Assumption: there exists a sequence {σ n} n N R + and a positive-definite matrix Σ such that Σ 1 and ( ) g d (n) := sup 1 P ) ( θ 1 θ S Φ Σ (S) S - cone σ 0 as n. n j=1

54 Extension to higher dimensions Assumption: there exists a sequence {σ n} n N R + and a positive-definite matrix Σ such that Σ 1 and ( ) g d (n) := sup 1 P ) ( θ 1 θ S Φ Σ (S) S - cone σ 0 as n. n Theorem (M., Strawn) For all 1 s k, ( )) θ (k) 2 1 s θ σ n (Const 1 (d) k + Const 2(d) k + g d (n) with probability 1 e s. Const 1 (d) = 6 log 4e 5/2 (d + 4) d + 2 (d 1) ln 4 Const 2 (d) = d + 2 (d 1) ln 4.

55 Example For the mean estimation problem, the bound becomes θ (k) 2 θ 32.4 Σ 1/2 cond(σ 1/2 ) C 1(d) + C 2 (d) s N 4N + 400d 1/4 E Σ 1/2 (X θ ) 3 2 n

56 Further questions What if asymptotic normality does not hold? What is the correct way to measure symmetry?

57 Further questions What if asymptotic normality does not hold? What is the correct way to measure symmetry? Is it possible to obtain the bounds with optimal dependence on the dimension?

58 Further questions What if asymptotic normality does not hold? What is the correct way to measure symmetry? Is it possible to obtain the bounds with optimal dependence on the dimension? Applications to robust optimization techniques, such as variants of the gradient descent methods.

59 Further questions What if asymptotic normality does not hold? What is the correct way to measure symmetry? Is it possible to obtain the bounds with optimal dependence on the dimension? Applications to robust optimization techniques, such as variants of the gradient descent methods. Empirical risk minimization based on the median-of-means?

60 Further questions What if asymptotic normality does not hold? What is the correct way to measure symmetry? Is it possible to obtain the bounds with optimal dependence on the dimension? Applications to robust optimization techniques, such as variants of the gradient descent methods. Empirical risk minimization based on the median-of-means? Extensions to Bayesian statistics?

61 Thank you for your attention!

Sub-Gaussian Estimators of the Mean of a Random Matrix with Entries Possessing Only Two Moments

Sub-Gaussian Estimators of the Mean of a Random Matrix with Entries Possessing Only Two Moments Stas Minsker University of Southern California July 21, 2016 ICERM Workshop Simple question: how to estimate