Bayesian Aggregation for Extraordinarily Large Dataset

Size: px

Start display at page:

Download "Bayesian Aggregation for Extraordinarily Large Dataset"

Derek Cannon
5 years ago
Views:

1 Bayesian Aggregation for Extraordinarily Large Dataset Guang Cheng 1 Department of Statistics Purdue University Department Seminar Statistics@LSE May 19, A Joint Work with Zuofeng Shang. Acknowledge NSF, ONR and Simons Foundation. 1 / 1

2 Big Data: large and complex There were 5 exabytes of information created between the dawn of civilization through 2003, but that much information is now created every 2 days. Eric Schmidt, Google CEO ( ) 1 EB=10 18 bytes and 1 ZB= bytes 2 / 1

3 Big Data: large and complex 3 / 1

4 Bayesian Aggregation This generic aggregation procedure applies to both finite dimensional parameter and infinite dimensional parameter. Big Data (N) Super machine R oracle (α) Divide Machine 1 Subset 1 (n 1 ) Subset 2 (n 2 ) Subset s (n s ) Machine 2 Machine s R 1 (α) R 2 (α) R s (α) Aggre gate R(α) R oracle (α): (1 α) oracle credible region constructed from the entire data (computationally prohibitive in practice, though); R j (α): (1 α) credible region constructed from the j-th subset. 4 / 1

5 A Series of Theoretical Questions... How to define an aggregation rule s.t. R(α) covers (1 α) posterior mass, with the same radius as R oracle (α)? How to construct a prior s.t. R(α) covers the true parameter (generating the data) with probability (1 α)? How fast can we allow s to diverge ( splitotics theory )? The above tasks are particularly challenging when the parameter in consideration is infinite dimensional, which is the focus of our talk today. 5 / 1

6 A Series of Theoretical Questions... How to define an aggregation rule s.t. R(α) covers (1 α) posterior mass, with the same radius as R oracle (α)? How to construct a prior s.t. R(α) covers the true parameter (generating the data) with probability (1 α)? How fast can we allow s to diverge ( splitotics theory )? The above tasks are particularly challenging when the parameter in consideration is infinite dimensional, which is the focus of our talk today. 6 / 1

7 A Series of Theoretical Questions... How to define an aggregation rule s.t. R(α) covers (1 α) posterior mass, with the same radius as R oracle (α)? How to construct a prior s.t. R(α) covers the true parameter (generating the data) with probability (1 α)? How fast can we allow s to diverge ( splitotics theory )? The above tasks are particularly challenging when the parameter in consideration is infinite dimensional, which is the focus of our talk today. 7 / 1

8 A Series of Theoretical Questions... How to define an aggregation rule s.t. R(α) covers (1 α) posterior mass, with the same radius as R oracle (α)? How to construct a prior s.t. R(α) covers the true parameter (generating the data) with probability (1 α)? How fast can we allow s to diverge ( splitotics theory )? The above tasks are particularly challenging when the parameter in consideration is infinite dimensional, which is the focus of our talk today. 8 / 1

9 Literature Review In the Bayesian community, the existing statistical studies mostly focus on computational or methodological aspects of MCMC-based distributed methods; Nonetheless, not much effort has been devoted to theoretically understanding scalable Bayesian procedures especially in a general nonparametric context; One particular reason is the failure of Bernstein-von Mises theorem in the nonparametric setting found by Cox (1993) and Freedman (1999). 9 / 1

10 Literature Review In the Bayesian community, the existing statistical studies mostly focus on computational or methodological aspects of MCMC-based distributed methods; Nonetheless, not much effort has been devoted to theoretically understanding scalable Bayesian procedures especially in a general nonparametric context; One particular reason is the failure of Bernstein-von Mises theorem in the nonparametric setting found by Cox (1993) and Freedman (1999). 10 / 1

11 Literature Review In the Bayesian community, the existing statistical studies mostly focus on computational or methodological aspects of MCMC-based distributed methods; Nonetheless, not much effort has been devoted to theoretically understanding scalable Bayesian procedures especially in a general nonparametric context; One particular reason is the failure of Bernstein-von Mises theorem in the nonparametric setting found by Cox (1993) and Freedman (1999). 11 / 1

12 Outline 12 / 1

13 What is Bernstein-von Mises (BvM) Theorem? BvM theorem 2 characterizes asymptotic shape of posterior distribution d(π( D n ), P 0 ( )) 0 as n, where Π( D n ) represents a posterior measure based on sample D n with size n, P 0 ( ) is a limiting probability measure, and d denotes a distance measure; For example, in parametric models BvM Theorem says sup Π(B D n ) N ( θ n, (ni θ0 ) 1 )(B) = o P n θ0 (1), B B where B is the Borel algebra on R d. 2 Named after two mathematicians: S. Bernstein and R. von Mises. 13 / 1

14 What is Bernstein-von Mises (BvM) Theorem? BvM theorem 2 characterizes asymptotic shape of posterior distribution d(π( D n ), P 0 ( )) 0 as n, where Π( D n ) represents a posterior measure based on sample D n with size n, P 0 ( ) is a limiting probability measure, and d denotes a distance measure; For example, in parametric models BvM Theorem says sup Π(B D n ) N ( θ n, (ni θ0 ) 1 )(B) = o P n θ0 (1), B B where B is the Borel algebra on R d. 2 Named after two mathematicians: S. Bernstein and R. von Mises. 14 / 1

15 A Graphical Illustration More importantly, BvM theorem implies the frequentist validity of Bayesian credible sets, called as BvM phenomenon, as P n θ 0 (θ 0 (1 α)-th credible set) 1 α. Π( D n ) P 0 ( ) BvM Theorem MCMC BvM Phenomenon (1 α)-th credible set 15 / 1

16 Nonparametric BvM: a negative example Consider Gaussian sequence models: Y i = θ 0i + 1 n ɛ i, i = 1, 2,..., where ɛ i iid N(0, 1). The true mean sequence {θ 0i } i=1 is square-summable, i.e., i=1 θ2 0i < ; Assign a (very innocent) Gaussian Prior: P0: θ i N(0, i 2p ) for some p > 1/2. Freedman (1999) demonstrated the failure of BvM: P n θ 0 (θ 0 (1 α) credible set) 0. The credible set is based on l 2 -norm. 16 / 1

17 Nonparametric BvM: a negative example Consider Gaussian sequence models: Y i = θ 0i + 1 n ɛ i, i = 1, 2,..., where ɛ i iid N(0, 1). The true mean sequence {θ 0i } i=1 is square-summable, i.e., i=1 θ2 0i < ; Assign a (very innocent) Gaussian Prior: P0: θ i N(0, i 2p ) for some p > 1/2. Freedman (1999) demonstrated the failure of BvM: P n θ 0 (θ 0 (1 α) credible set) 0. The credible set is based on l 2 -norm. 17 / 1

18 Nonparametric BvM: a negative example Consider Gaussian sequence models: Y i = θ 0i + 1 n ɛ i, i = 1, 2,..., where ɛ i iid N(0, 1). The true mean sequence {θ 0i } i=1 is square-summable, i.e., i=1 θ2 0i < ; Assign a (very innocent) Gaussian Prior: P0: θ i N(0, i 2p ) for some p > 1/2. Freedman (1999) demonstrated the failure of BvM: P n θ 0 (θ 0 (1 α) credible set) 0. The credible set is based on l 2 -norm. 18 / 1

19 A Solution: Tuning Prior The power of smoothing spline (Wahba, 1990)! We will show that nonparametric BvM theorem can be rescured under a new class of Gaussian process (GP) priors motivated by smoothing spline, named as tuning prior ; Take Gaussian regression models as an example 3 : Y i = f 0 (X i ) + ɛ i, i = 1, 2,..., n, where ɛ i iid N(0, 1) and f H m (0, 1), a m-th order Sobolev space. Denote its log-likelihood function as l n (f) = n (Y i f(x i )) 2 /2. i=1 3 Our nonparametric BvM results hold in a general exponential family. No conjugacy is needed. 19 / 1

20 A Solution: Tuning Prior The power of smoothing spline (Wahba, 1990)! We will show that nonparametric BvM theorem can be rescured under a new class of Gaussian process (GP) priors motivated by smoothing spline, named as tuning prior ; Take Gaussian regression models as an example 3 : Y i = f 0 (X i ) + ɛ i, i = 1, 2,..., n, where ɛ i iid N(0, 1) and f H m (0, 1), a m-th order Sobolev space. Denote its log-likelihood function as l n (f) = n (Y i f(x i )) 2 /2. i=1 3 Our nonparametric BvM results hold in a general exponential family. No conjugacy is needed. 20 / 1

21 A Solution: Tuning Prior The power of smoothing spline (Wahba, 1990)! We will show that nonparametric BvM theorem can be rescured under a new class of Gaussian process (GP) priors motivated by smoothing spline, named as tuning prior ; Take Gaussian regression models as an example 3 : Y i = f 0 (X i ) + ɛ i, i = 1, 2,..., n, where ɛ i iid N(0, 1) and f H m (0, 1), a m-th order Sobolev space. Denote its log-likelihood function as l n (f) = n (Y i f(x i )) 2 /2. i=1 3 Our nonparametric BvM results hold in a general exponential family. No conjugacy is needed. 21 / 1

22 A Solution: Tuning Prior The power of smoothing spline (Wahba, 1990)! We will show that nonparametric BvM theorem can be rescured under a new class of Gaussian process (GP) priors motivated by smoothing spline, named as tuning prior ; Take Gaussian regression models as an example 3 : Y i = f 0 (X i ) + ɛ i, i = 1, 2,..., n, where ɛ i iid N(0, 1) and f H m (0, 1), a m-th order Sobolev space. Denote its log-likelihood function as l n (f) = n (Y i f(x i )) 2 /2. i=1 3 Our nonparametric BvM results hold in a general exponential family. No conjugacy is needed. 22 / 1

23 Tuning Prior: A General Framework Assume that f follows a probability measure Π λ ; Specify Π λ through its Radon-Nikodym derivative w.r.t. a base measure Π (also on H m (0, 1)) as follows: ( dπ λ (f) exp nλ ) dπ 2 J(f), (1.1) where J(f) is a type of roughness penalty used in smoothing spline literature. 23 / 1

24 Tuning Prior: A General Framework Assume that f follows a probability measure Π λ ; Specify Π λ through its Radon-Nikodym derivative w.r.t. a base measure Π (also on H m (0, 1)) as follows: ( dπ λ (f) exp nλ ) dπ 2 J(f), (1.1) where J(f) is a type of roughness penalty used in smoothing spline literature. 24 / 1

25 Tuning Prior: Duality Based on (1.1), we have the posterior as P (f D n ) := = exp(l n (f))dπ λ (f) H m (0,1) exp(l n(f))dπ λ (f) exp(l n,λ (f))dπ(f) H m (0,1) exp(l n,λ(f))dπ(f), where l n,λ (f) = l n (f) nλj(f). Smoothing spline estimate f n,λ := arg max l n,λ(f); f H m (0,1) The name tuning prior now makes sense. So, we can employ GCV to select a proper tuning prior (and we did!); More importantly, we are able to borrow the recent advances in smoothing spline inference theory (Shang and C., 2013, AoS) to build a foundation of nonpara. BvM. 25 / 1

26 Tuning Prior: Duality Based on (1.1), we have the posterior as P (f D n ) := = exp(l n (f))dπ λ (f) H m (0,1) exp(l n(f))dπ λ (f) exp(l n,λ (f))dπ(f) H m (0,1) exp(l n,λ(f))dπ(f), where l n,λ (f) = l n (f) nλj(f). Smoothing spline estimate f n,λ := arg max l n,λ(f); f H m (0,1) The name tuning prior now makes sense. So, we can employ GCV to select a proper tuning prior (and we did!); More importantly, we are able to borrow the recent advances in smoothing spline inference theory (Shang and C., 2013, AoS) to build a foundation of nonpara. BvM. 26 / 1

27 Tuning Prior: Duality Based on (1.1), we have the posterior as P (f D n ) := = exp(l n (f))dπ λ (f) H m (0,1) exp(l n(f))dπ λ (f) exp(l n,λ (f))dπ(f) H m (0,1) exp(l n,λ(f))dπ(f), where l n,λ (f) = l n (f) nλj(f). Smoothing spline estimate f n,λ := arg max l n,λ(f); f H m (0,1) The name tuning prior now makes sense. So, we can employ GCV to select a proper tuning prior (and we did!); More importantly, we are able to borrow the recent advances in smoothing spline inference theory (Shang and C., 2013, AoS) to build a foundation of nonpara. BvM. 27 / 1

28 Tuning Prior: Gaussian Process Construction To satisfy (1.1), we choose Π λ and Π as two Gaussian measures induced by GP priors as specified below (this can be verified by applying Hájek s Lemma); Assign a GP prior on f, i.e., Π λ, as follows: f G λ ( ) = w ν ϕ ν ( ), ν=1 where (recall that m is the smoothness of f 0 ) { N(0, 1), ν = 1,..., m w ν ( ) N 0, (ρ 1+β/2m ν + nλρ ν ) 1, ν > m, for a sequence ρ ν ν 2m ; Π is induced by a similar GP (by setting λ = 0). 28 / 1

29 Tuning Prior: Gaussian Process Construction To satisfy (1.1), we choose Π λ and Π as two Gaussian measures induced by GP priors as specified below (this can be verified by applying Hájek s Lemma); Assign a GP prior on f, i.e., Π λ, as follows: f G λ ( ) = w ν ϕ ν ( ), ν=1 where (recall that m is the smoothness of f 0 ) { N(0, 1), ν = 1,..., m w ν ( ) N 0, (ρ 1+β/2m ν + nλρ ν ) 1, ν > m, for a sequence ρ ν ν 2m ; Π is induced by a similar GP (by setting λ = 0). 29 / 1

30 Tuning Prior: Gaussian Process Construction To satisfy (1.1), we choose Π λ and Π as two Gaussian measures induced by GP priors as specified below (this can be verified by applying Hájek s Lemma); Assign a GP prior on f, i.e., Π λ, as follows: f G λ ( ) = w ν ϕ ν ( ), ν=1 where (recall that m is the smoothness of f 0 ) { N(0, 1), ν = 1,..., m w ν ( ) N 0, (ρ 1+β/2m ν + nλρ ν ) 1, ν > m, for a sequence ρ ν ν 2m ; Π is induced by a similar GP (by setting λ = 0). 30 / 1

31 Our construction of GP prior is motivated from Wahba Bayesian view on smoothing spline (Wahba, 1990); The RKHS induced by G λ is essentially H m+β/2 (0, 1), where β adjusts the prior support; In addition, we need to assume β (1, 2m + 1) to guarantee E{J(G λ, G λ )} < such that the sample path of G λ belongs to H m (0, 1) a.s.. 31 / 1

32 Our construction of GP prior is motivated from Wahba Bayesian view on smoothing spline (Wahba, 1990); The RKHS induced by G λ is essentially H m+β/2 (0, 1), where β adjusts the prior support; In addition, we need to assume β (1, 2m + 1) to guarantee E{J(G λ, G λ )} < such that the sample path of G λ belongs to H m (0, 1) a.s.. 32 / 1

33 Our construction of GP prior is motivated from Wahba Bayesian view on smoothing spline (Wahba, 1990); The RKHS induced by G λ is essentially H m+β/2 (0, 1), where β adjusts the prior support; In addition, we need to assume β (1, 2m + 1) to guarantee E{J(G λ, G λ )} < such that the sample path of G λ belongs to H m (0, 1) a.s.. 33 / 1

34 Underlying Eigensystem (ϕ ν ( ), ρ ν ) Under mild conditions, f admits a Fourier expansion: f( ) = f ν ϕ ν ( ), ν=1 where ϕ ν ( ) s are basis functions in H m (0, 1). An example for (ϕ ν, ρ ν ) is the following ODE solution: ϕ (2m) ν ( ) = ρ ν ϕ ν ( ), ϕ (j) (0) = ϕ (j) (1) = 0, j = 2,..., 2m 1, ν where ϕ ν s have closed forms. This is also called as uniform free beam problem in physics. ν 34 / 1

35 Underlying Eigensystem (ϕ ν ( ), ρ ν ) Under mild conditions, f admits a Fourier expansion: f( ) = f ν ϕ ν ( ), ν=1 where ϕ ν ( ) s are basis functions in H m (0, 1). An example for (ϕ ν, ρ ν ) is the following ODE solution: ϕ (2m) ν ( ) = ρ ν ϕ ν ( ), ϕ (j) (0) = ϕ (j) (1) = 0, j = 2,..., 2m 1, ν where ϕ ν s have closed forms. This is also called as uniform free beam problem in physics. ν 35 / 1

36 Nonparametric BvM theorem Theorem 1 Given that λ n 2m/(2m+β), we have sup P (S D n ) Π W (S) = o P n f0 (1), S H m (0,1) where Π W ( ) is the probability measure induced by a GP W. 36 / 1

37 Specifications of the Limiting GP W Suppose that f n,λ ( ) = ν=0 f n,ν ϕ ν ( ); The mean function of W (also the approximate posterior mode of P ( D n )) is f n,λ := a n,ν fn,ν ϕ ν ( ). ν=0 Hence, f n,λ f n,λ (but very close); The mean-zero GP W n := W f n,λ is expressed as W n ( ) = ν=0 b n,ν z ν ϕ ν ( ) and z ν iid N(0, 1); Here, a n,ν and b n,ν are both non-random sequences. 37 / 1

38 Specifications of the Limiting GP W Suppose that f n,λ ( ) = ν=0 f n,ν ϕ ν ( ); The mean function of W (also the approximate posterior mode of P ( D n )) is f n,λ := a n,ν fn,ν ϕ ν ( ). ν=0 Hence, f n,λ f n,λ (but very close); The mean-zero GP W n := W f n,λ is expressed as W n ( ) = ν=0 b n,ν z ν ϕ ν ( ) and z ν iid N(0, 1); Here, a n,ν and b n,ν are both non-random sequences. 38 / 1

39 Specifications of the Limiting GP W Suppose that f n,λ ( ) = ν=0 f n,ν ϕ ν ( ); The mean function of W (also the approximate posterior mode of P ( D n )) is f n,λ := a n,ν fn,ν ϕ ν ( ). ν=0 Hence, f n,λ f n,λ (but very close); The mean-zero GP W n := W f n,λ is expressed as W n ( ) = ν=0 b n,ν z ν ϕ ν ( ) and z ν iid N(0, 1); Here, a n,ν and b n,ν are both non-random sequences. 39 / 1

40 Specifications of the Limiting GP W Suppose that f n,λ ( ) = ν=0 f n,ν ϕ ν ( ); The mean function of W (also the approximate posterior mode of P ( D n )) is f n,λ := a n,ν fn,ν ϕ ν ( ). ν=0 Hence, f n,λ f n,λ (but very close); The mean-zero GP W n := W f n,λ is expressed as W n ( ) = ν=0 b n,ν z ν ϕ ν ( ) and z ν iid N(0, 1); Here, a n,ν and b n,ν are both non-random sequences. 40 / 1

41 Recall Bayesian Aggregation Big Data (N) Super machine R oracle (α) Divide Machine 1 Subset 1 (n) Subset 2 (n) Subset s (n) Machine 2 Machine s R 1,n (α) R 2,n (α) R s,n (α) Aggre gate R N (α) Note that N = s n. Both n and s are allowed to diverge. 41 / 1

42 Uniform Nonparametric BvM Theorem Uniform BvM theorem characterizes limit shapes of a sequence of s nonparametric posterior distributions (under proper tuning priors) as long as s does not grow too fast. Theorem 2 Given that λ N 2m/(2m+β) (used in each subset with size n), we have sup S H m (0,1) max P (S D j,n) Π Wj (S) = o P n 1 j s f0 (1) as long as s does not grow faster than N (β 1)/(2m+β). 42 / 1

43 Aggregated Credible Interval The j-th credible ball is defined as R j,n (α) = {f H m (0, 1) : f f j,n 2 r j,n (α)}, where the radius r j,n (α) is directly obtained via MCMC; The aggregated credible ball is constructed as R N (α) = {f H m (0, 1) : f f N,λ 2 r N (α)}; As will be seen, the aggregation step is through weighted averaging Fourier frequencies and weighted averaging individual radii. No additional computation is needed. 43 / 1

44 Aggregated Credible Interval The j-th credible ball is defined as R j,n (α) = {f H m (0, 1) : f f j,n 2 r j,n (α)}, where the radius r j,n (α) is directly obtained via MCMC; The aggregated credible ball is constructed as R N (α) = {f H m (0, 1) : f f N,λ 2 r N (α)}; As will be seen, the aggregation step is through weighted averaging Fourier frequencies and weighted averaging individual radii. No additional computation is needed. 44 / 1

45 Aggregated Credible Interval The j-th credible ball is defined as R j,n (α) = {f H m (0, 1) : f f j,n 2 r j,n (α)}, where the radius r j,n (α) is directly obtained via MCMC; The aggregated credible ball is constructed as R N (α) = {f H m (0, 1) : f f N,λ 2 r N (α)}; As will be seen, the aggregation step is through weighted averaging Fourier frequencies and weighted averaging individual radii. No additional computation is needed. 45 / 1

46 Implication of Uniform BvM Theorem Uniform BvM shows that R N (α) (asymptotically) covers (1 α) posterior mass and also possesses frequentist validity as long as λ N 2m/(2m+β) and s = o(n (β 1/(2m+β)) ). 46 / 1

47 Aggregation Details Aggregated center: f N,λ ( ) = a N,ν fν ϕ ν ( ) and ν=1 Aggregated radius: r N (α) = 1 ζ 1,N + N where ζ k,n = ζ 2,N n ζ 2,n s f ν = (1/s) s j=1 f (j) n,ν; s rj,n 2 (α) ζ 1,n, j=1 ( n ) k. τν 2 + n(1 + λρ ν ) ν=1 In fact, the aggregated radius r N is (asymptotically) the same as that of oracle credible ball; see simulations. 47 / 1

48 Aggregation Details Aggregated center: f N,λ ( ) = a N,ν fν ϕ ν ( ) and ν=1 Aggregated radius: r N (α) = 1 ζ 1,N + N where ζ k,n = ζ 2,N n ζ 2,n s f ν = (1/s) s j=1 f (j) n,ν; s rj,n 2 (α) ζ 1,n, j=1 ( n ) k. τν 2 + n(1 + λρ ν ) ν=1 In fact, the aggregated radius r N is (asymptotically) the same as that of oracle credible ball; see simulations. 48 / 1

49 Aggregation Details Aggregated center: f N,λ ( ) = a N,ν fν ϕ ν ( ) and ν=1 Aggregated radius: r N (α) = 1 ζ 1,N + N where ζ k,n = ζ 2,N n ζ 2,n s f ν = (1/s) s j=1 f (j) n,ν; s rj,n 2 (α) ζ 1,n, j=1 ( n ) k. τν 2 + n(1 + λρ ν ) ν=1 In fact, the aggregated radius r N is (asymptotically) the same as that of oracle credible ball; see simulations. 49 / 1

50 Aggregated Credible Interval We can also aggregate individual credible intervals for linear functionals of f, denoted as F (f). Two examples: Evaluation functional: F z (f) = f(z); Integral functional: F ω (f) = 1 f(z)ω(z)dz for a known 0 function ω( ) such as an indicator function; Individual credible interval for F (f): CI F j,n(α) := {f S m (I) : F (f) F ( f j,n ) r F,j,n (α)}; The aggregated version is constructed as CI F N(α) := {f S m (I) : F (f) F ( f N,λ ) r F,N (α)}, where r F,N (α) is a weighted l 2 average of r F,j,n (α) s. 50 / 1

51 Aggregated Credible Interval We can also aggregate individual credible intervals for linear functionals of f, denoted as F (f). Two examples: Evaluation functional: F z (f) = f(z); Integral functional: F ω (f) = 1 f(z)ω(z)dz for a known 0 function ω( ) such as an indicator function; Individual credible interval for F (f): CI F j,n(α) := {f S m (I) : F (f) F ( f j,n ) r F,j,n (α)}; The aggregated version is constructed as CI F N(α) := {f S m (I) : F (f) F ( f N,λ ) r F,N (α)}, where r F,N (α) is a weighted l 2 average of r F,j,n (α) s. 51 / 1

52 Aggregated Credible Interval We can also aggregate individual credible intervals for linear functionals of f, denoted as F (f). Two examples: Evaluation functional: F z (f) = f(z); Integral functional: F ω (f) = 1 f(z)ω(z)dz for a known 0 function ω( ) such as an indicator function; Individual credible interval for F (f): CI F j,n(α) := {f S m (I) : F (f) F ( f j,n ) r F,j,n (α)}; The aggregated version is constructed as CI F N(α) := {f S m (I) : F (f) F ( f N,λ ) r F,N (α)}, where r F,N (α) is a weighted l 2 average of r F,j,n (α) s. 52 / 1

53 Aggregated Credible Interval We can also aggregate individual credible intervals for linear functionals of f, denoted as F (f). Two examples: Evaluation functional: F z (f) = f(z); Integral functional: F ω (f) = 1 f(z)ω(z)dz for a known 0 function ω( ) such as an indicator function; Individual credible interval for F (f): CI F j,n(α) := {f S m (I) : F (f) F ( f j,n ) r F,j,n (α)}; The aggregated version is constructed as CI F N(α) := {f S m (I) : F (f) F ( f N,λ ) r F,N (α)}, where r F,N (α) is a weighted l 2 average of r F,j,n (α) s. 53 / 1

54 Aggregated Credible Interval We can also aggregate individual credible intervals for linear functionals of f, denoted as F (f). Two examples: Evaluation functional: F z (f) = f(z); Integral functional: F ω (f) = 1 f(z)ω(z)dz for a known 0 function ω( ) such as an indicator function; Individual credible interval for F (f): CI F j,n(α) := {f S m (I) : F (f) F ( f j,n ) r F,j,n (α)}; The aggregated version is constructed as CI F N(α) := {f S m (I) : F (f) F ( f N,λ ) r F,N (α)}, where r F,N (α) is a weighted l 2 average of r F,j,n (α) s. 54 / 1

55 Aggregated Credible Interval We can also aggregate individual credible intervals for linear functionals of f, denoted as F (f). Two examples: Evaluation functional: F z (f) = f(z); Integral functional: F ω (f) = 1 f(z)ω(z)dz for a known 0 function ω( ) such as an indicator function; Individual credible interval for F (f): CI F j,n(α) := {f S m (I) : F (f) F ( f j,n ) r F,j,n (α)}; The aggregated version is constructed as CI F N(α) := {f S m (I) : F (f) F ( f N,λ ) r F,N (α)}, where r F,N (α) is a weighted l 2 average of r F,j,n (α) s. 55 / 1

56 A Series of Theoretical Questions... How to define an aggregation rule s.t. R(α) covers (1 α) posterior mass, with the same radius as R oracle (α)? Weighted averaging individual centers (in terms of their Fourier coefficients) and radii by analytical formula. How to construct a prior s.t. R(α) covers the true parameter (generating the data) with probability (1 α)? Pick a proper tuning prior by GCV. How fast can we allow s to diverge? s cannot grow faster than a rate jointly determined by the smoothness of f 0 and the smoothness of GP prior. 56 / 1

57 A Series of Theoretical Questions... How to define an aggregation rule s.t. R(α) covers (1 α) posterior mass, with the same radius as R oracle (α)? Weighted averaging individual centers (in terms of their Fourier coefficients) and radii by analytical formula. How to construct a prior s.t. R(α) covers the true parameter (generating the data) with probability (1 α)? Pick a proper tuning prior by GCV. How fast can we allow s to diverge? s cannot grow faster than a rate jointly determined by the smoothness of f 0 and the smoothness of GP prior. 57 / 1

58 A Series of Theoretical Questions... How to define an aggregation rule s.t. R(α) covers (1 α) posterior mass, with the same radius as R oracle (α)? Weighted averaging individual centers (in terms of their Fourier coefficients) and radii by analytical formula. How to construct a prior s.t. R(α) covers the true parameter (generating the data) with probability (1 α)? Pick a proper tuning prior by GCV. How fast can we allow s to diverge? s cannot grow faster than a rate jointly determined by the smoothness of f 0 and the smoothness of GP prior. 58 / 1

59 Simulations Gaussian regression models: where ɛ N(0, 1) and Y = f 0 (X) + ɛ, f 0 (x) = 3β 30,17 (x) + 2β 3,11 (x), where β a,b is the pdf of Beta distribution. Set m = 2; Assign a tuning prior with β = 2 and λ being selected by GCV as follows; Let λ GCV be the GCV-selected tuning parameter with the order N 2m/(2m+1) by applying to the entire data (A practical formula needs to be developed here). Set λ as λ (2m+1)/(2m+β) GCV to match with the order N 2m/(2m+β). 59 / 1

60 Simulations Gaussian regression models: where ɛ N(0, 1) and Y = f 0 (X) + ɛ, f 0 (x) = 3β 30,17 (x) + 2β 3,11 (x), where β a,b is the pdf of Beta distribution. Set m = 2; Assign a tuning prior with β = 2 and λ being selected by GCV as follows; Let λ GCV be the GCV-selected tuning parameter with the order N 2m/(2m+1) by applying to the entire data (A practical formula needs to be developed here). Set λ as λ (2m+1)/(2m+β) GCV to match with the order N 2m/(2m+β). 60 / 1

61 Simulations Gaussian regression models: where ɛ N(0, 1) and Y = f 0 (X) + ɛ, f 0 (x) = 3β 30,17 (x) + 2β 3,11 (x), where β a,b is the pdf of Beta distribution. Set m = 2; Assign a tuning prior with β = 2 and λ being selected by GCV as follows; Let λ GCV be the GCV-selected tuning parameter with the order N 2m/(2m+1) by applying to the entire data (A practical formula needs to be developed here). Set λ as λ (2m+1)/(2m+β) GCV to match with the order N 2m/(2m+β). 61 / 1

62 f x Figure 1: Plot of the true function f / 1

63 Computing Time ρ N=1200 ρ ρ N= ACR FCR ACR FCR γ γ N= ACR FCR γ Figure 2: ρ versus γ based on FCR and ACR, where ρ = (T 0 T )/T 0, T 0 is computing time based on big data and T is the D&C time. And, γ = log s/ log N describes the growth of s. 63 / 1

64 Phase Transition: Coverage Probability 1 α = α = 0.90 CP ACR FCR CP γ γ 1 α = α = 0.50 CP CP γ γ 1 α = α = 0.10 CP CP γ γ Figure 3: Frequentist coverage probability (CP) of R N (α) against γ for N = Red-dotted line indicates the position of 1 α. 64 / 1

65 Phase Transition: Radius radius γ Figure 4: Radius of R N (α) against γ for various α. 65 / 1

66 Thanks for your attention. Questions are welcome. 66 / 1

Can we do statistical inference in a non-asymptotic way? 1

Can we do statistical inference in a non-asymptotic way? 1 Guang Cheng 2 Statistics@Purdue www.science.purdue.edu/bigdata/ ONR Review Meeting@Duke Oct 11, 2017 1 Acknowledge NSF, ONR and Simons Foundation.