Can we do statistical inference in a non-asymptotic way? 1

Size: px

Start display at page:

Download "Can we do statistical inference in a non-asymptotic way? 1"

Nicholas Riley
5 years ago
Views:

1 Can we do statistical inference in a non-asymptotic way? 1 Guang Cheng 2 Statistics@Purdue ONR Review Meeting@Duke Oct 11, Acknowledge NSF, ONR and Simons Foundation. 2 An exploratory and ongoing work with Z. Shang and Y. Yang. 2 / 30

2 Starting from Stat 101 iid Given that X 1,.., X n N(θ 0, 1), we have ( [ P θ 0 X ± 1.96 ]) = 95%; n 3 / 30

3 Starting from Stat 101 iid Given that X 1,.., X n N(θ 0, 1), we have ( [ P θ 0 X ± 1.96 ]) = 95%; n Without knowing the distribution of X i 's, we obtain via CLT ( [ P θ 0 X ± 1.96 ]) 95%. n But the price is the asymptotic validity; 3 / 30

4 Starting from Stat 101 iid Given that X 1,.., X n N(θ 0, 1), we have ( [ P θ 0 X ± 1.96 ]) = 95%; n Without knowing the distribution of X i 's, we obtain via CLT ( [ P θ 0 X ± 1.96 ]) 95%. n But the price is the asymptotic validity; What if the distribution of X is unknown and n is nite? 3 / 30

5 A rst thought: concentration inequality Concentration inequalities quantify how a random variable X deviates around its mean or median µ. They usually take the form of two-sided bounds for the tails of X µ, such as P ( X µ > t) something small. The simplest concentration inequality is Chebyshev inequality; 4 / 30

6 A rst thought: concentration inequality Concentration inequalities quantify how a random variable X deviates around its mean or median µ. They usually take the form of two-sided bounds for the tails of X µ, such as P ( X µ > t) something small. The simplest concentration inequality is Chebyshev inequality; Concentration inequality holds for any sample size n. 4 / 30

7 Does concentration inequality idea really work? For example, Hoeding inequality gives ( [ P θ 0 X ± 1.27 ]) c X ψ2 95% n for iid X i 's with sub-gaussian tails 3. X ψ2 is the so-called sub-gaussian norm, e.g., X ψ2 = 1/ log 2 for Rademacher X; 3 Relaxable to sub-exponential tails via Bernstein inequality. 5 / 30

8 Does concentration inequality idea really work? For example, Hoeding inequality gives ( [ P θ 0 X ± 1.27 ]) c X ψ2 95% n for iid X i 's with sub-gaussian tails 3. X ψ2 is the so-called sub-gaussian norm, e.g., X ψ2 = 1/ log 2 for Rademacher X; For illustration, let us examine one special cases, i.e., Bernoulli, in the class of sub-gaussian random variables; 3 Relaxable to sub-exponential tails via Bernstein inequality. 5 / 30

9 Does concentration inequality idea really work? For example, Hoeding inequality gives ( [ P θ 0 X ± 1.27 ]) c X ψ2 95% n for iid X i 's with sub-gaussian tails 3. X ψ2 is the so-called sub-gaussian norm, e.g., X ψ2 = 1/ log 2 for Rademacher X; For illustration, let us examine one special cases, i.e., Bernoulli, in the class of sub-gaussian random variables; When X i 's iid Ber(θ 0 ), we have for any sample size n P ( θ 0 [ X ± 1.36/ n] ) 95%, which is sharp in the rate but not the constant in comparison with the asymptotic CI, i.e., [ X ± 0.98/ n]. Set θ 0 = Relaxable to sub-exponential tails via Bernstein inequality. 5 / 30

10 Empirical comparison for Bernoulli Left plot compares width of condence intervals of two methods; right plot demonstrate their coverage probability versus (small) sample size. Simulation were repeated 1000 times. Two competing eects: Conservativness vs Asymptotics 6 / 30

11 How far can we go beyond these simple cases? There are some recent results on more complicated parametric models obtained via various notion of concentration inequality such as Chernozhukov et al (2013) and Strawn et al (2014). 7 / 30

12 How far can we go beyond these simple cases? There are some recent results on more complicated parametric models obtained via various notion of concentration inequality such as Chernozhukov et al (2013) and Strawn et al (2014). Chernozhukov et al (2013) presented non-asymptotic results on bootstrap validity for high dimensional problems such as (1/ n) n i=1 X i for X i R p with p n; 7 / 30

13 How far can we go beyond these simple cases? There are some recent results on more complicated parametric models obtained via various notion of concentration inequality such as Chernozhukov et al (2013) and Strawn et al (2014). Chernozhukov et al (2013) presented non-asymptotic results on bootstrap validity for high dimensional problems such as (1/ n) n i=1 X i for X i R p with p n; Strawn et al (2014) obtained non-asymptotic bounds on the expected concentration of a posterior (around the true parameter) in high dimensional Bayesian linear models; 7 / 30

14 How far can we go beyond these simple cases? There are some recent results on more complicated parametric models obtained via various notion of concentration inequality such as Chernozhukov et al (2013) and Strawn et al (2014). Chernozhukov et al (2013) presented non-asymptotic results on bootstrap validity for high dimensional problems such as (1/ n) n i=1 X i for X i R p with p n; Strawn et al (2014) obtained non-asymptotic bounds on the expected concentration of a posterior (around the true parameter) in high dimensional Bayesian linear models; As far as we are aware, these bounds were mostly derived to justify (asymptotic) statistical estimation/inference in a non-asymptotic manner, rather than used to construct nite-sample inference. 7 / 30

15 Outline 1 Smoothing spline models 2 A non-asymptotic Bahadur representation 3 Statistical application: hypothesis testing 4 Simulations 8 / 30

16 Smoothing spline models Consider a nonparametric regression model y i = f(x i ) + ɛ i. For simplicity, we assume that x I := [0, 1], and that x and ɛ are independent with Eɛ = 0 and Var(ɛ) = 1; 9 / 30

17 Smoothing spline models Consider a nonparametric regression model y i = f(x i ) + ɛ i. For simplicity, we assume that x I := [0, 1], and that x and ɛ are independent with Eɛ = 0 and Var(ɛ) = 1; The standard smoothing spline estimate (or more generally, kernel ridge estimate) is obtained as f n := arg min f H 1 n n i=1 reproducing kernel Hilbert space (RKHS) (y i f(x i )) 2 + λ f 2 H }{{} roughness penalty 9 / 30

18 A non-asymptotic Bahadur representation 10 / 30

19 Starting from a naive concentration inequality We rst investigate the concentration property of T n := f n f 0. Lemma 1 Under proper non-asymptotic conditions on λ > 0 and M > 0, it holds that sup P f0 (T n d n (M, λ)) 2 exp( M), (1) f 0 H m (1) where d n (M, λ) = 2λ 1/m + c K ( 2M + 1)(nλ 1 2m ) 1/2. Note that Lemma 1 holds uniformly over an unit ball for any sample size. H m (1) := {f S m (I) : f H 1} 11 / 30

20 A closer examination of the naive concentration We notice that E{ f n } f 0 P λ f 0 f 0! The term P λ f 0 is the estimation bias caused by penalization; 12 / 30

21 A closer examination of the naive concentration We notice that E{ f n } f 0 P λ f 0 f 0! The term P λ f 0 is the estimation bias caused by penalization; If using T n = f n f 0 to test H 0 : f = f 0, we prove that its minimal separation rate is n m 2m+1, which is sub-optimal in comparison with the minimax rate of testing n 2m 4m+1 ; 12 / 30

22 A closer examination of the naive concentration We notice that E{ f n } f 0 P λ f 0 f 0! The term P λ f 0 is the estimation bias caused by penalization; If using T n = f n f 0 to test H 0 : f = f 0, we prove that its minimal separation rate is n m 2m+1, which is sub-optimal in comparison with the minimax rate of testing n 2m 4m+1 ; To make inference, we need to develop a second order expansion, which we call as non-asymptotic Bahadur representation. 12 / 30

23 Review on (asymptotic) Bahadur representation Bahadur representations (Bahadur, 1966) are often useful to study the asymptotic properties of statistical estimators; 13 / 30

24 Review on (asymptotic) Bahadur representation Bahadur representations (Bahadur, 1966) are often useful to study the asymptotic properties of statistical estimators; For example, He and Shao (1996) considered a general M-estimation framework: (1/n) n ψ(x i, θ n ) = o(δ n ) i=1 for some δ n 0, and obtained that as n θ n θ 0 n i=1 for some R n o(n 1/2 ). Dn 1 ψ(x i, θ 0 ) = O(R n ) almost surely. 13 / 30

25 Non-asymptotic Bahadur representation Inspired by Shang and Cheng (2013), we dene T n := f n (I P λ )f 0 1 n ɛ i K Xi n and study its concentration property. Theorem 2 i=1 Under similar conditions as Lemma 1, it holds that ( sup P f0 Tn d ) n (M, λ) 2 exp( M), f 0 H m (1) where d n (M, λ) = c 2 K Mn 1/2 λ 1/(2m) A(λ)d n (M, λ). 14 / 30

26 Statistical application: hypothesis testing 15 / 30

27 Nonparametric hypothesis of interest Consider the following hypothesis testing problem: e.g., f 0 = 0. H 0 : f = f 0, vs H 1 : f f 0, 16 / 30

28 Test statistic Clearly, T n = f n (I P λ )f 0 n 1 n i=1 ɛ ik Xi cannot be used as a test statistic since ɛ i 's are not observable; 17 / 30

29 Test statistic Clearly, T n = f n (I P λ )f 0 n 1 n i=1 ɛ ik Xi cannot be used as a test statistic since ɛ i 's are not observable; Rather, we propose to use T n as test statistic: T n = f n n (I P λ )f 0 2 E n 1 ɛ i K Xi 2, = f n (I P λ )f n ν 1 i= λ/ρ ν ; 17 / 30

30 Test statistic Clearly, T n = f n (I P λ )f 0 n 1 n i=1 ɛ ik Xi cannot be used as a test statistic since ɛ i 's are not observable; Rather, we propose to use T n as test statistic: T n = f n n (I P λ )f 0 2 E n 1 ɛ i K Xi 2, = f n (I P λ )f n ν 1 i= λ/ρ ν ; In practice, replace (I P λ )f 0 by a noiseless version of f n : f NL n = arg min f S m (I) 1 n n (f 0 (X i ) f(x i )) 2 + λ f S m (I). i=1 17 / 30

31 Exact Type I and II errors control Based on the non-asymptotic Bahadur representation, we can control on Type I and II errors for any nite sample size as follows. Theorem 3 Under proper non-asymptotic conditions on λ, α and β, we have Type I error : P f0 ( T n d n (log(15/α), λ)) α, Type II error : sup P f ( T n d n (log(15/α), λ)) β, f f 0 Hα,β m (1) for any α, β (0, 1). The cut-o value d n (M, λ) 4 Mρ K nλ 1/(4m). The separation set Hα,β m (1) := {g Hm (1) : g ρ n (log(15/α), log(30/β), λ)}, where ρ n (M, L, λ) ζ K λ + 4ρ K M/(nh 1/(4m) ). 18 / 30

32 Asymptotic minimax optimality An important implication of Theorem 3 is that T n,λ is asymptotically minimax optimal; 19 / 30

33 Asymptotic minimax optimality An important implication of Theorem 3 is that T n,λ is asymptotically minimax optimal; To see that, we minimize the separation function ρ n (log(15/α), log(30/β), λ) over λ, and obtain the minimal separation rate n 2m/(4m+1) when λ is chosen as λ := ( (4ρK ) ) 2 2m/(4m+1) log(15/α) n 4m/(4m+1) ζ K n 4m/(4m+1). 19 / 30

34 Asymptotic minimax optimality An important implication of Theorem 3 is that T n,λ is asymptotically minimax optimal; To see that, we minimize the separation function ρ n (log(15/α), log(30/β), λ) over λ, and obtain the minimal separation rate n 2m/(4m+1) when λ is chosen as λ := ( (4ρK ) ) 2 2m/(4m+1) log(15/α) n 4m/(4m+1) ζ K n 4m/(4m+1). Following similar derivations, we prove that the test statistic based on the naive concentration, i.e., T n = f n f 0, is actually sub-optimal. So, our intuition of not using it is correct. 19 / 30

35 Extensions Composite Hypothesis, e.g., H 0 : f F 0 vs H 1 : f F 0, where F 0 = {f : f is linear on I}; 20 / 30

36 Extensions Composite Hypothesis, e.g., H 0 : f F 0 vs H 1 : f F 0, where F 0 = {f : f is linear on I}; Kernel ridge regression with dierent decaying rates of eigen-values; 20 / 30

37 Extensions Composite Hypothesis, e.g., H 0 : f F 0 vs H 1 : f F 0, where F 0 = {f : f is linear on I}; Kernel ridge regression with dierent decaying rates of eigen-values; Non-Gaussian regression with a smooth log-concave density such as logistic regression. 20 / 30

38 Simulations 21 / 30

39 Simulations Consider the following hypothesis: where f 0 = 5(x 2 x ). H 0 : f = f 0 vs H 1 : f f 0, 22 / 30

40 Simulations Consider the following hypothesis: H 0 : f = f 0 vs H 1 : f f 0, where f 0 = 5(x 2 x ). Data were generated as follows Y i = f c (X i ) + ɛ i, and f c (x) = 1 2 c x2 + f 0 (x); 22 / 30

41 Simulations Consider the following hypothesis: where f 0 = 5(x 2 x ). H 0 : f = f 0 vs H 1 : f f 0, Data were generated as follows Y i = f c (X i ) + ɛ i, and f c (x) = 1 2 c x2 + f 0 (x); Set ɛ i iid N(0, 1) and X i iid Unif(0, 1); 22 / 30

42 Simulations Consider the following hypothesis: H 0 : f = f 0 vs H 1 : f f 0, where f 0 = 5(x 2 x ). Data were generated as follows Y i = f c (X i ) + ɛ i, and f c (x) = 1 2 c x2 + f 0 (x); Set ɛ i iid N(0, 1) and X i iid Unif(0, 1); Set Type I and II errors as α = β = / 30

43 Remark: cuto value Recall that the cuto value d n (log(15/α), λ) is provided in Corollary 3. However, we found that it is quite conservative due to the use of some loose concentration inequalities; 23 / 30

44 Remark: cuto value Recall that the cuto value d n (log(15/α), λ) is provided in Corollary 3. However, we found that it is quite conservative due to the use of some loose concentration inequalities; Fortunately, we can simulate T n as follows. By conditioning on X, we simulate a number of synthetic datasets {Y (k) } N k=1 from the null model Y (k) i = f 0 (X i ) + N(0, 1), T (k) n each of which yields a new test statistic. The cuto value (k) is set as the (1 α)-th sample quantile of { T n } N k=1. 23 / 30

45 Remark: the choice of λ Rather, our theoretical analysis is useful in choosing λ; 24 / 30

46 Remark: the choice of λ Rather, our theoretical analysis is useful in choosing λ; In simulations, we choose λ by numerically minimizing the separation function ρ n (log(15/α), log(30/β), λ) over λ. The resulting λ is denoted as λ F S ; 24 / 30

47 Remark: the choice of λ Rather, our theoretical analysis is useful in choosing λ; In simulations, we choose λ by numerically minimizing the separation function ρ n (log(15/α), log(30/β), λ) over λ. The resulting λ is denoted as λ F S ; This non-asymptotic approach is much computationally cheaper than the conventional generalized cross validation; 24 / 30

48 Remark: the choice of λ Rather, our theoretical analysis is useful in choosing λ; In simulations, we choose λ by numerically minimizing the separation function ρ n (log(15/α), log(30/β), λ) over λ. The resulting λ is denoted as λ F S ; This non-asymptotic approach is much computationally cheaper than the conventional generalized cross validation; Note that all constants in ρ n (M, L, λ) and d n (M, λ) such as ρ K, ζ K and c K only depend on λ and the eigenvalues ρ ν 's, which can be well approximately by the empirical eigenvalues of the reproducing kernel matrix. 24 / 30

49 Empirical comparison with asymptotically valid test For simple hypothesis, we compare four testing procedures: (S1) The proposed T n with the smoothing parameter λ F S ; (S2) Asymptotically valid penalized likelihood ratio test (Shang and Cheng, 2013), denoted as P LRT (f 0 ), with the same λ F S ; (S3) The proposed T n with λ selected by GCV, denoted as λ GCV ; (S4) PLRT statistic P LRT (f 0 ) with the same λ GCV. Note that the cut-o value for PLRT in S2 and S4 is obtained from Monte Carlo simulation and the null limit distribution given in Shang and Cheng (2013), respectively. 25 / 30

50 bias in nonparametric testing; see Remark 3.2. The third observation is that as c increases, h GCV Simulation continues decreasing results and becomes closer to h F S, but never reaches h F S. This is consistent with their different asymptotic orders (recall h F S n 1/(2m+1/2) and h GCV n 1/(2m+1) ). n c λ 1/4 F S RPF S RPP LRT λ 1/4 GCV RPGCV RP GCV (0.008) (0.009) (0.009) (0.009) (0.007) (0.008) (0.007) (0.003) (0.006) (0.007) (0.006) (0.003) (0.006) (0.006) (0.005) (0.005) (0.004) (0.005) (0.004) (0.003) Table 1 Simulation results for simple hypothesis testing. λgcv is an average value over 500 replicates (that varies as c). RPF S, RPP LRT, RPGCV and RP GCV are average rejection proportions by Tn with λf S, PLRT with λf S, Tn with λgcv and PLRT with λgcv respectively, over 500 replicates. 26 / 30

51 Empirical observations Overall, all four procedures have comparable type I errors, i.e., c = 0, for any sample size; 27 / 30

52 Empirical observations Overall, all four procedures have comparable type I errors, i.e., c = 0, for any sample size; As for power performances, we note that 27 / 30

53 Empirical observations Overall, all four procedures have comparable type I errors, i.e., c = 0, for any sample size; As for power performances, we note that (i) the test using λ F S is always more powerful than those using λ GCV. This justies the nite sample advantage of the non-asymptotic formula in selecting λ; 27 / 30

54 Empirical observations Overall, all four procedures have comparable type I errors, i.e., c = 0, for any sample size; As for power performances, we note that (i) the test using λ F S is always more powerful than those using λ GCV. This justies the nite sample advantage of the non-asymptotic formula in selecting λ; (ii) T n is always more powerful than PLRT given the same choice of λ. In other words, S1 is always the most powerful one. This supports the need of removing estimation bias in nonparametric testing; 27 / 30

55 Empirical observations Overall, all four procedures have comparable type I errors, i.e., c = 0, for any sample size; As for power performances, we note that (i) the test using λ F S is always more powerful than those using λ GCV. This justies the nite sample advantage of the non-asymptotic formula in selecting λ; (ii) T n is always more powerful than PLRT given the same choice of λ. In other words, S1 is always the most powerful one. This supports the need of removing estimation bias in nonparametric testing; The third observation is that as c increases, λ GCV continues decreasing and becomes closer to λ F S, but never reaches λ F S. This is consistent with their dierent asymptotic orders, i.e., λ F S n 2m/(2m+1/2) and λ GCV n 2m/(2m+1). 27 / 30

56 Discussions The non-asymptotic Bahadur representation seems a possible direction to pursue in nite-sample inference; 28 / 30

57 Discussions The non-asymptotic Bahadur representation seems a possible direction to pursue in nite-sample inference; The practical usefulness of nite-sample inference procedures rely on computably sharp concentration inequality. Can this be done through data re-sampling or MCMC? 28 / 30

58 Discussions The non-asymptotic Bahadur representation seems a possible direction to pursue in nite-sample inference; The practical usefulness of nite-sample inference procedures rely on computably sharp concentration inequality. Can this be done through data re-sampling or MCMC? Does it exist something like non-asymptotic Fisher information? 28 / 30

59 Discussions The non-asymptotic Bahadur representation seems a possible direction to pursue in nite-sample inference; The practical usefulness of nite-sample inference procedures rely on computably sharp concentration inequality. Can this be done through data re-sampling or MCMC? Does it exist something like non-asymptotic Fisher information? Can the non-asymptotic approach be extended to Bayesian domain, e.g., choose hyper-parameter in hierarchical priors? 28 / 30

60 Questions? Thanks to ONR for support 29 / 30

61 Some necessary notation In smoothing spline models, we set H as an m-th order Sobolev space, denoted as S m (I), and f 2 H as I {f (m) (x)} 2 dx; Dene f, g = E{f(X)g(X)} + λ I f (m) (x)g (m) (x)dx. Endowed with,, S m (I) is an RKHS; Let K(x 1, x 2 ) be the reproducing kernel function satisfying K x, f = f(x) with K x ( ) = K(x, ). The underlying eigen-system is denoted as {ρ ν, φ ν ( )} ν=1, where ρ ν ν 2m ; For later use, dene c K = λ 1 4m sup K 1/2 (x, x), x I ρ K = λ 1 2m E[K 2 (X 1, X 2 )] with X 1, X 2 iid X. 30 / 30

Bayesian Aggregation for Extraordinarily Large Dataset

Bayesian Aggregation for Extraordinarily Large Dataset Guang Cheng 1 Department of Statistics Purdue University www.science.purdue.edu/bigdata Department Seminar Statistics@LSE May 19, 2017 1 A Joint Work