Bayesian Sparse Linear Regression with Unknown Symmetric Error

Size: px

Start display at page:

Download "Bayesian Sparse Linear Regression with Unknown Symmetric Error"

Moris McCoy
5 years ago
Views:

1 Bayesian Sparse Linear Regression with Unknown Symmetric Error Minwoo Chae 1 Joint work with Lizhen Lin 2 David B. Dunson 3 1 Department of Mathematics, The University of Texas at Austin 2 Department of Statistics and Data Sciences, The University of Texas at Austin 3 Department of Statistical Science, Duke University June 17, 2016 BU-KEIO 2016 Workshop 1 / 38

2 2 / 38 Outline 1 Introduction 2 Sparse linear model 3 Linear model with unknwon error distribution 4 Asymptotic results

3 3 / 38 Outline 1 Introduction 2 Sparse linear model 3 Linear model with unknwon error distribution 4 Asymptotic results

4 4 / 38 Symmetric location problem Y i = µ + ɛ i, ɛ i iid η( ) (unknown) If η is symmetric, efficient and adaptive estimation of µ is possible. [Beran, 1974; Stone 1975;...] Linear regression [Bickel, 1982]: µ = x T i θ, θ R p, i = 1,..., n. For Bayesian, the semi-parametric Bernstein-von Mises (BvM) theorem holds. [Chae, Kim and Kleijn, 2016] We study a Bayesian approach when p is large.

5 5 / 38 Bayesian paradigm A parameter θ is generated acoording to a prior distribution Π. Conditional on θ, the data X is generated according to a density p θ. For given observed data X, statistical inferences are based on the posterior distribution: dπ(θ X) p θ (X)dΠ(θ). Typically, the posterior distribution can be approximated via MCMC.

6 6 / 38 Bayesian asymptotics A frequentist would like to know their performance in a frequentist viewpoint. Assume that the data X 1,..., X n is generated according to a given parameter θ 0 and consider the posterior Π(θ X 1,..., X n ). For large enough n, we want Π(θ X 1,..., X n ) to put most of its mass near θ 0 for most X 1,..., X n. For infinite dimensional θ, the choice of the prior is important.

7 7 / 38 Parametric Bernstein-von Mises theorem Assume that a parametric model P = {P θ : θ Θ} is regular and iid X 1,..., X n P θ0, where θ 0 Θ. THEOREM (Bernstein-von Mises) [Le Cam and Yang, 1990] For any prior with positive density around θ 0, Π( X 1,..., X n ) N (ˆθn, I 1 θ 0 /n ) P TV 0, where ˆθ n is an efficient estimator for θ and I θ0 is the Fisher information matrix. The Bayesian credible interval is a standard confidence interval.

8 8 / 38 Parametric BvM: Illustration θ Beta(5, 1), X 1,..., X n θ iid Bernoulli(θ), θ 0 = 1/2 density n=10 n=50 n=

9 9 / 38 Bayesian asymptotics A frequentist would like to know their performance in a frequentist viewpoint. Assume that the data X 1,..., X n is generated according to a given parameter θ 0 and consider the posterior Π(θ X 1,..., X n ). For large enough n, we want Π(θ X 1,..., X n ) to put most of its mass near θ 0 for most X 1,..., X n. For infinite dimensional θ, the choice of the prior is important.

10 10 / 38 Semi-parametric BvM (fixed p) Y i = x T i θ + ɛ i, ɛ i iid η( ) (unknown) Put a symmetrized Dirichlet process (DP) mixture prior on η. THEOREM [Chae, Kim and Kleijn, 2016] For any prior on θ, with positive density around θ 0, Π(θ X 1,..., X n ) N (ˆθn, I 1 θ 0,η 0 /n ) P TV 0, where ˆθ n is an efficient estimator for θ and I θ0,η 0 is the efficient information matrix. What if p is large?

11 11 / 38 Outline 1 Introduction 2 Sparse linear model 3 Linear model with unknwon error distribution 4 Asymptotic results

12 12 / 38 Sparse linear model Consider the linear regression model Y i = x T i θ + ɛ i, i = 1,..., n where θ = (θ 1,..., θ p ) T and possibly p n. Simply, Y = Xθ + ɛ. A sparse model assumes that most of θ i s are (nearly) zero. We apply full Bayesian procedures, and express the sparsity in priors.

13 Sparse prior A prior Π Θ for θ R p can be constructed as follows: 1 (Dimension) Choose s from prior π p on {0, 1,..., p}. 2 (Model) Choose S {0, 1,..., p} of size S = s at random. 3 (Nonzero coeff.) Choose θ S = (θ i ) i S from density g S on R S and set θ S c = 0. Formally, (S, θ) π p (s) 1 ( p s)g S (θ S )δ 0 (θ S c). Prior π p on the dimension controls the level of sparsity. 13 / 38

14 14 / 38 Sparse prior: example Spike and slab [Ishwaran and Rao 2005; and many authors] for some r (0, 1), similarly, s Binomial(p, r) θ i (1 r)δ 0 + rg, i p for some continuous distribution G. Good asymptotic properties if r Beta(1, p u ) for some u > 1 and tail of G is as thick as Laplace. [Castillo and van der Vaart, 2015]

15 15 / 38 Sparse prior: example Complexity prior [Castillo and van der Vaart, 2012] π p (s) c s p as, s = 0, 1,..., p for some constants a, c > 0. Roughly, π p (s) ( ) p 1, for s p. s

16 16 / 38 Other priors Continuous shrinkage priors that peaks near zero. Typically, scale mixtures of normals: for i = 1,..., p, θ i τ 2, λ 2 i N(0, τ 2 λ 2 i ), λ 2 i π λ (λ 2 i ),, τ 2 π τ (τ 2 ). 1 Bayesian Lasso [Park and Casella, 2008] 2 Horseshoe [Carvalho, Polson and Scott, 2010] 3 Normal-gamma [Griffin and Brown, 2010] 4 Generalized double Pareto [Amagan, Dunson and Lee, 2013] 5 Dirichlet-Laplace [Bhattacharya et al., 2016] 6...

17 17 / 38 Outline 1 Introduction 2 Sparse linear model 3 Linear model with unknwon error distribution 4 Asymptotic results

18 18 / 38 Gaussian model Y i = x T i θ + ɛ i, i = 1,..., n. Assume that ɛ i i.i.d. η for some density η H. Usually it is assumed that η(y) = φ σ (y) because of 1 computational simplicity, and 2 good theoretical properties. Some properties (e.g. consistency and rate) tend to be robust to misspecification.

19 19 / 38 Key problems Y i = x T i θ + ɛ i, i = 1,..., n. Assume that ɛ i s are not really normally distributed. Key problems caused from model misspecification: 1 (Efficiency) Asymptotic variance of n(ˆθ i θ i ) can be large. 2 (Uncertainty quantification) Credible sets do not give valid confidence. [Kleijn and van der Vaart, 2012] 3 (Selection) Misspecification might result in serious overfitting. [Grünwald and Ommen, 2014] Good remedy : semi-parametric modelling.

20 20 / 38 Key problems: example [Grünwald and Ommen, 2014] Y i = θ int + θ 1 x i + θ 2 x 2 i + + θ p x p i + ɛ i, θ 0 = 0 R p+1

21 21 / 38 Key problems Y i = x T i θ + ɛ i, i = 1,..., n. Assume that ɛ i s are not really normally distributed. Key problems caused from model misspecification: 1 (Efficiency) Asymptotic variance of n(ˆθ i θ i ) can be large. 2 (Uncertainty quantification) Credible sets do not give valid confidence. [Kleijn and van der Vaart, 2012] 3 (Selection) Misspecification might result in serious overfitting. [Grünwald and Ommen, 2014] Good remedy : semi-parametric modelling.

22 22 / 38 Frequentist s method for fixed p Y i = x T i θ + ɛ i, ɛ i η. There is an efficient estimator for θ. [Bickel, 1982] One way to get an efficient estimator is: 1 Find an initial n 1/2 -consistent estimator θ n. 2 Estimate the score function with perturbed sample ɛ i = Y i θ T n X i. 3 Solve the score equation using one step Newton-Raphson iteration. Does it work if p n?

23 23 / 38 Bayesian method for fixed p Y i = xi T θ + ɛ i, ɛ i η. Put a symmetrized DP mixture prior Π H on η: η(y) = φ σ (y z)df(z, σ), F DP(α), and df(z, σ) = df(z, σ) + df( z, σ). 2 Then, the BvM theorem holds. [Chae, Kim and Kleijn, 2016] Inference: Gibbs sampler algorithm

24 24 / 38 Bayesian inference Y i = x T i θ + ɛ i Y i = x T i θ + z i + σ i ɛ i ɛ i η (z i, σ i ) F, ɛ i N(0, 1) Inference can be done through Gibbs sampler algorithm: 1 For given (z i, σ i ) i n, θ can be sampled as in the Gaussian model. 2 For given θ, (z i, σ i ) i n can be sampled as in the DPM model. Additional computational burden by semi-parametric modelling depends only on n. Feasible when p n!

25 25 / 38 Outline 1 Introduction 2 Sparse linear model 3 Linear model with unknwon error distribution 4 Asymptotic results

26 Goal: frequentist properties (p n) Assume fixed design X, and responce vector Y is really generated from a given (θ 0, η 0 ), possibly p n. We want (marginal) posterior Π(θ Y): 1 (Recovery) to put most of its mass around θ 0 2 (Uncertainty quantification) to express remaining uncertainty 3 (Selection) to find the true nonzero set S 0 of θ 0 4 (Adaptation) to adapt unknwon sparsity level and error density with high P θ0,η 0 -probability. 26 / 38

27 27 / 38 Prior for θ The probability π p (s) decrease exponentially: [Castillo and van der Vaart, 2012; 2015] (i) for some constants A 1, A 2, A 3, A 4 > 0, A 1 p A 3 π p (s 1) π p (s) A 2 p A 4 π p (s 1), s = 1,..., p Tails of nonzero coeff. are as thick as Laplace distribution: [Castillo and van der Vaart, 2012; van der Pas et al., 2016] (ii) g S (θ) = i S g(θ i ), g(θ i ) e λ θ i and λ satisfies n p λ n log p.

28 28 / 38 Prior for η Put a symmetrized DP mixture prior Π H on η [Chae, Kim and Kleijn, 2016] : η(y) = φ σ (y z)df(z, σ), F DP(α), and df(z, σ) = df(z, σ) + df( z, σ). 2 Assume that supp(α) [ M, M] [σ 1, σ 2 ] for some positive constants M and σ 1 < σ 2.

29 29 / 38 Design matrix Assume unifomrly bounded covariates: x ij 1. Define uniform compatibility numbers and restricted eigenvalues φ 2 (s) = inf { sθ Xθ 2 2 n θ 2 1 ψ 2 (s) = inf { Xθ 2 2 n θ 2 2 } : 0 < s θ s } : 0 < s θ s. φ(ks 0 ) 1 (ψ(ks 0 ) 1, resp.) for some const. K > 1 is sufficient for the recovery of θ in l 1 - (l 2 -, resp.) norm.

30 30 / 38 Design matrix: examples By C-S inequality, φ(s) ψ(s). ψ(s) 1 in many examples: 1 Typically, ψ(s) const. s max i j corr(x i, x j ). [Lounici, 2008] 2 If x ij s are i.i.d. random variables, then ψ(s) 1 with high probability for s n/ log p. [Cai and Jiang, 2011] 3 If p = n and corr(x i, x j ) = ρ i j for some ρ (0, 1), then ψ(p) 1. [Zhao and Yu, 2006] There are some examples such that φ(s) 1 but not for ψ(s). [van de Geer and Bühlmann, 2009]

31 31 / 38 Asymptotic: dimension THEOREM [Chae, Lin and Dunson, 2016] If λ θ 0 1 s 0 log p and s 0 log p n, then EΠ ( s θ > Ks 0 Y ) 0 for some constant K > 1. Small value of λ is preferred for large θ 0 1.

32 Asymptotic: consistency d 2 n((θ, η), (θ 0, η 0 )) = 1 n n dh(p 2 θ,η,i, p θ0,η 0,i). i=1 Mean Hellinger distance d n allows to construct certain exponentially consistent tests for independent observations. [Birgé, 1983; Ghosal and van der Vaart 2007] THEOREM [Chae, Lin and Dunson, 2016] If, furthermore, φ(ks 0 ) p 1, then ( s0 log p EΠ d n ((θ, η), (θ 0, η 0 )) n ) Y / 38

33 33 / 38 Asymptotic: consistency (cont.) THEOREM [Chae, Lin and Dunson, 2016] Under the previous conditions, ( s0 log p ) EΠ d H (η, η 0 ) Y 0. n If, furthermore, s 2 0 log p/φ2 (Ks 0 ) n, then ( EΠ θ θ 0 1 s 0 log p ) Y 0 φ(ks 0 ) n ( 1 s0 log p ) EΠ θ θ 0 2 Y 0 ψ(ks 0 ) n ( EΠ X(θ θ 0 ) 2 ) s 0 log p Y 0.

34 34 / 38 Asymptotic: LAN r n (θ, η) = L n (θ, η) L n (θ 0, η) { n(θ θ0 ) T G n lθ0,η 0 n } 2 (θ θ 0) T V n,η0 (θ θ 0 ) THEOREM [Chae, Lin and Dunson, 2016] If s 0 log p n 1/6, then sup sup r n (θ, η) = o P (1), η H n θ Θ n where Π(Θ n H n Y) 1 in probability.

35 35 / 38 Asymptotic: BvM theorem Let N n,s be the S -dimensional normal dist n to which an efficient estimator n(ˆθ S θs 0 ) converges in dist n. THEOREM [Chae, Lin and Dunson, 2016] If, furthermore, λs 0 log p n and ψ(ks0 ) 1, then sup Π( n(θs θ 0,S ) B Y, S θ = S) N n,s (B) = o P (1), sup S S n B where Π(S θ S n Y) 1 in probability. Posterior dist n of nonzero coeff. is asymptotically a mixture of normal dist n.

36 36 / 38 Asymptotic: selection THEOREM [Chae, Lin and Dunson, 2016] Under the previous conditions, Π(S θ S 0 Y) 0 in probability. The true non-zero coeff. can be selected if every non-zero coeff. is not very small (beta-min condition).

37 37 / 38 Discussion Condition s 0 log p n 1/6 is required due to semi-parametric bias. If η is known (may not be a Gaussian) and p = s 0, the condition may be reduced to s 0 n 1/3, and this cannot be improved. [Panov and Spokoiny, 2015] In some parametric models, s 0 n 1/6 is required for BvM theorem. [Ghosal, 2000] Results can be extended to more general prior, i.e., M, σ 1 and σ 1 0, but sub-gaussian tail of l η0 is (maybe) essential in selection. [Kim and Jeon, 2016]

38 Selected references [1] Castillo, I., Schmidt-Hieber, J., and van der Vaart, A. W. (2015). Bayesian linear regression with sparse priors. Ann. Statist. [2] Chae, M. (2015). The semiparametric Bernstein von Mises theorem for models with symmetric error. PhD thesis, Seoul National University. arxiv. [3] Chae, M., Kim, Y., and Kleijn, B. J. K. (2016). The semi-parametric Bernstein-von Mises theorem for regression models with symmetric errors. arxiv. [4] Chae, M., Lin, L., and Dunson, D. B. (2016). Bayesian sparse linear regression with unknown symmetric error. arxiv. [5] Grünwald, P. and van Ommen, T. (2014). Inconsistency of Bayesian inference for misspecified linear models, and a proposal for repairing it. arxiv. [6] Hanson, D. L. and Wright, F. T. (1971). A bound on tail probabilities for quadratic forms in independent random variables. Ann. Math. Statist. [7] Pollard, D. (2001). Bracketing methods. Unpublished manuscript. Available at pollard/books/asymptopia/bracketing.pdf. [8] van de Geer, S. A. and Bühlmann, P. (2009). On the conditions used to prove oracle results for the Lasso. Electron. J. Statist. 38 / 38

Statistica Sinica Preprint No: SS R2

Statistica Sinica Preprint No: SS-2017-0074.R2 Title The semi-parametric Bernstein-von Mises theorem for regression models with symmetric errors Manuscript ID SS-2017-0074.R2 URL http://www.stat.sinica.edu.tw/statistica/