Bayesian Statistics in High Dimensions

Size: px

Start display at page:

Download "Bayesian Statistics in High Dimensions"

Emil Atkinson
5 years ago
Views:

1 Bayesian Statistics in High Dimensions Lecture 2: Sparsity Aad van der Vaart Universiteit Leiden, Netherlands 47th John H. Barrett Memorial Lectures, Knoxville, Tenessee, May 2017

2 Contents Sparsity Bayesian Sparsity Frequentist Bayes Model Selection Prior Horseshoe Prior

3 Sparsity

4 Sparsity sequence model A sparse model has many parameters, but most of them are (nearly) zero.

5 Sparsity sequence model A sparse model has many parameters, but most of them are (nearly) zero. In this lecture, (Bayesian) theory for: Y n N n (θ,i), for θ = (θ 1,...,θ n ) R n. n independent observations Y1 n,...,y n n. n unknowns θ 1,...,θ n. Yi n = θ i +ε i, for standard normal noise ε i. n is large. many of θ 1,...,θ n are (almost) zero.

6 History Given Y n θ N n (θ,i) the obvious estimator of θ is Y n. ML, UMVU, best-equivariant, minimax

7 History Given Y n θ N n (θ,i) the obvious estimator of θ is Y n. ML, UMVU, best-equivariant, minimax, but inadmissible. Theorem [Stein, 1956] If n 3, then there exists T such that θ R n, E θ T(Y n ) θ 2 < E θ Y n θ 2.

8 History Given Y n θ N n (θ,i) the obvious estimator of θ is Y n. ML, UMVU, best-equivariant, minimax, but inadmissible. Theorem [Stein, 1956] If n 3, then there exists T such that θ R n, E θ T(Y n ) θ 2 < E θ Y n θ 2. Empirical Bayes method [Robbins, 1960s]: Working hypothesis: θ 1,...,θ n iid G. Estimate G using Y n, pretending this is true. T(Y n ):= EĜ(Y n ) (θ Y n ).

9 History Given Y n θ N n (θ,i) the obvious estimator of θ is Y n. ML, UMVU, best-equivariant, minimax, but inadmissible. Theorem [Stein, 1956] If n 3, then there exists T such that θ R n, E θ T(Y n ) θ 2 < E θ Y n θ 2. Empirical Bayes method [Robbins, 1960s]: Working hypothesis: θ 1,...,θ n iid G. Estimate G using Y n, pretending this is true. T(Y n ):= EĜ(Y n ) (θ Y n ). For large n the gain can be substantial, ( borrowing of strength ), and be targeted to special subsets of R n, e.g. sparse vectors.

10 Sparsity regression Y n θ N n (X n p θ,σ 2 I), for θ = (θ 1,...,θ p ) R p Yi n : measurement on individual i = 1,...,n. X ij score of individual i on feature j = 1,...p. θ j effect of feature j. sparse if only few features matter. If p > n, then sparsity is necessary to recover θ, and X must be sparse-invertible, e.g.: Xθ Compatibility: 2 Sθ inf 0. θ: S θ 5s n X θ 1 s Mutual coherence: n max cor(x.i,x.j ) 1. i j s n = true number of nonzero coordinates.

11 Sparsity RNA sequencing Y i,j : RNA expression count of tag j = 1,...,p in tissue i = 1,...,n, x i : covariate(s) of tissue i, e.g. 0 or 1 for normal or cancer. sparse if only few tags (genes) matter. Y i,j (zero-inflated) negative binomial, with EY i,j = e α j+β j x i, vary i,j = EY i,j ( 1+EYi,j e φ j ). distribution of β j s and φ j s estimated by Empirical Bayes [Smyth & Robinson et al., van der Wiel, Leday, van Wieringen, & vdv et al., 12]

12 Sparsity Gaussian graphical model Data n iid copies of Y p θ N p (0,θ 1 ) Y p j = value of individual on feature j. precision matrix θ gives partial correlations: cor(y p i,y p j Y p k :k i,j) = θ i,j θi,i θ j,j. nodes 1,2,...,p edge (i,j) present iff θ i,j 0 sparse if few edges Apoptosis network

13 Bayesian Sparsity

14 Bayesian sparsity A sparse model has many parameters, but most of them are (nearly) zero.

15 Bayesian sparsity A sparse model has many parameters, but most of them are (nearly) zero. We express this in the prior, and apply the standard (full or empirical) Bayesian machine.

16 Bayesian sparsity A sparse model has many parameters, but most of them are (nearly) zero. We express this in the prior, and apply the standard (full or empirical) Bayesian machine. Prior θ Π, and data Y n θ p( θ), give posterior: dπ(θ Y n ) p(y n θ)dπ(θ).

17 Bayesian sparsity Gaussian graphical model nodes 1,2,...,p edge (i,j) present iff θ i,j 0 sparse if few edges Apoptosis network For given incidence matrix (P i,j ) use different priors for (θ i,j :P i,j = 0) and (θ i,j :P i,j = 1). [Leday, Kpogbezan, vdv, van Wieringen, van der Wiel (2016, 17).]

18 Model selection prior Constructive definition of prior Π for θ R p : (1) Choose s from prior on {0,1,2,...,p}. (2) Choose S {0,1,...,p} of size S = s at random. (3) Choose θ S = (θ i :i S) g S and set θ S c = 0. [Mitchell & Beachamp (88), George, George & McCulloch, Yuan, Berger, Johnstone & Silverman, Richardson et al., Johnson & Rossell, Chao Gao,...]

19 Model selection prior Constructive definition of prior Π for θ R p : (1) Choose s from prior on {0,1,2,...,p}. (2) Choose S {0,1,...,p} of size S = s at random. (3) Choose θ S = (θ i :i S) g S and set θ S c = 0. Example: spike and slab Choose θ 1,...,θ p i.i.d. from τδ 0 +(1 τ)g. Put a prior on τ, e.g. Beta(1,p+1). Then s binomial and g S = i S g. [Mitchell & Beachamp (88), George, George & McCulloch, Yuan, Berger, Johnstone & Silverman, Richardson et al., Johnson & Rossell, Chao Gao,...]

20 Horseshoe prior Constructive definition of prior Π for θ R p : (1) Generate τ Cauchy + (0,σ) (?) (2) Generate ψ 1,..., ψ p iid from Cauchy + (0,τ). (3) Generate independent θ i N(0,ψ i ).

21 Horseshoe prior Constructive definition of prior Π for θ R p : (1) Generate τ Cauchy + (0,σ) (?) (2) Generate ψ 1,..., ψ p iid from Cauchy + (0,τ). (3) Generate independent θ i N(0,ψ i ). Motivation if θ N(0,ψ) and Y θ N(θ,1), then θ Y,ψ N ( (1 κ)y,1 κ ) for κ = 1/(1+ψ). This suggests a prior for κ that concentrates near 0 or 1. [Carvalho & Polson & Scott, 10.]

22 Horseshoe prior Constructive definition of prior Π for θ R p : (1) Generate τ Cauchy + (0,σ) (?) (2) Generate ψ 1,..., ψ p iid from Cauchy + (0,τ). (3) Generate independent θ i N(0,ψ i ). Motivation if θ N(0,ψ) and Y θ N(θ,1), then θ Y,ψ N ( (1 κ)y,1 κ ) for κ = 1/(1+ψ). This suggests a prior for κ that concentrates near 0 or 1. prior shrinkage factor prior of θ i posterior mean of θ i as function of Y i [Carvalho & Polson & Scott, 10.]

23 Other sparsity priors Bayesian LASSO: θ 1,...,θ p iid from a mixture of Laplace (λ) distributions over λ Γ(a,b). Bayesian bridge: Same but with Laplace replaced with a density e λy α. Normal-Gamma: θ 1,...,θ p iid from a Gamma scale mixture of Gaussians. Correlated multivariate normal-gamma: θ = Cφ for a p k-matrix C and φ with independent normal-gamma (a i,1/2) coordinates. Horseshoe. Horseshoe+. Normal spike. Scalar multiple of Dirichlet. Nonparametric Dirichlet.... [Park & Casella 08, Polson & Scott, Griffin & Brown 10, 12, Carvalho & Polson & Scott, 10, George& Rockova 13, Bhattacharya et al. 12,...]

24 LASSO is not Bayesian p ] ˆθ LASSO = argmin [ Y n Xθ 2 +λ n θ i. θ i=1 posterior mode for prior θ i iid Laplace(λ n ), works great, but the full posterior distribution is useless.

25 LASSO is not Bayesian p ] ˆθ LASSO = argmin [ Y n Xθ 2 +λ n θ i. θ i=1 posterior mode for prior θ i iid Laplace(λ n ), works great, but the full posterior distribution is useless. Theorem If n/λ n then E 0 Π n ( θ 2 n/λ n Y n) 0. λ n = 2logn gives almost no Bayesian shrinkage. Trouble: λ n must be large to shrink θ i to 0, but small to model nonzero θ i.

26 Frequentist Bayes

27 Frequentist Bayes Assume data Y n follows a given parameter θ 0. Consider posterior Π(θ Y n ) as random measure on parameter set. We like Π(θ Y n ): to put most of its mass near θ 0 for most Y n. to have a spread that expresses remaining uncertainty. to select the model defined by the nonzero parameters of θ 0. We evaluate this by probabilities or expectations, given θ 0.

28 Benchmarks for recovery sequence model Y n N n (θ,i), for θ = (θ 1,...,θ n ) R n. θ 0 = #(1 i n:θ i 0), n θ 2 2 = θ i 2. Frequentist benchmark: minimax rate relative to 2 over: black bodies {θ: θ 0 s n }: sn log(n/s n ). i=1 [(if s n with s n /n 0.) Donoho & Johnstone, Golubev, Johnstone and Silverman, Abramovich et al.,...]

29 Benchmarks for recovery sequence model Y n N n (θ,i), for θ = (θ 1,...,θ n ) R n. θ 0 = #(1 i n:θ i 0), n θ q q = θ i q, 0 < q 2. i=1 Frequentist benchmarks: minimax rate relative to q over: black bodies {θ: θ 0 s n }: log(n/sn ). s 1/q n

30 Benchmarks for recovery sequence model Y n N n (θ,i), for θ = (θ 1,...,θ n ) R n. θ 0 = #(1 i n:θ i 0), n θ q q = θ i q, 0 < q 2. i=1 Frequentist benchmarks: minimax rate relative to q over: black bodies {θ: θ 0 s n }: log(n/sn ). s 1/q n weak l r -balls m r [s n ]:= {θ:max i i θ [i] r n(s n /n) r }: n 1/q (s n /n) r/q log(n/s n ) 1 r/q. [(if s n with s n /n 0.) Donoho & Johnstone, Golubev, Johnstone and Silverman, Abramovich et al.,...]

31 Model Selection Prior

32 Model selection prior Y n N n (θ,i), for θ = (θ 1,...,θ n ) R n. Constructive definition of prior Π for θ R p : (1) Choose s from prior π n on {0,1,2,...,n}. (2) Choose S {0,1,...,n} of size S = s at random. (3) Choose θ S = (θ i :i S) from density g S on R S and set θ S c = 0.

33 Model selection prior Y n N n (θ,i), for θ = (θ 1,...,θ n ) R n. Constructive definition of prior Π for θ R p : (1) Choose s from prior π n on {0,1,2,...,n}. (2) Choose S {0,1,...,n} of size S = s at random. (3) Choose θ S = (θ i :i S) from density g S on R S and set θ S c = 0. Assume π n (s) cπ n (s 1), for some c < 1 and every s. g S = i S e h, for uniformly Lipschitz h:r R. s n := θ 0 0, n, s n /n 0. Examples: complexity prior: π n (s) e aslog(bn/s). spike and slab: θ i iid τδ 0 +(1 τ)lap with τ B(1,n+1).

34 Numbers Single data with θ 0 = (0,...,0,5,...,5) and n = 500 and θ 0 0 = 100. Red dots: marginal posterior medians Orange: marginal credible intervals Green dots: data points. g standard Laplace density. ( ) π n (k) 2n k 0.1 n (left) and πn (k) ( ) 2n k (right). n

35 Dimensionality of posterior distribution Theorem [black body] There exists M such that sup θ 0 0 s n E θ0 Π n ( θ: θ 0 Ms n Y n) 0. Outside the space in which θ 0 lives, the posterior is concentrated in low-dimensional subspaces along the coordinate axes.

36 Recovery Theorem [black body] For every 0 < q 2 and large M, ( sup E θ0 Π n θ: θ θ0 q > Mr n s 1/q 1/2 n Y n) 0, θ 0 0 s n for r 2 n = s n log(n/s n ) log(1/π n (s n )). If π n (s n ) e as nlog(n/s n ) minimax rate is attained.

37 Selection S θ := {1 i n:θ i 0}. Theorem [No supersets] sup θ 0 0 s n E θ0 Π n (θ:s θ S θ0,s θ S θ0 Y n ) 0.

38 Selection S θ := {1 i n:θ i 0}. Theorem Theorem [No supersets] sup θ 0 0 s n E θ0 Π n (θ:s θ S θ0,s θ S θ0 Y n ) 0. [Finds big signals] inf θ 0 0 s n E θ0 Π n ( θ:sθ {i: θ 0,i logn} Y n) 1.

39 Selection S θ := {1 i n:θ i 0}. Theorem Theorem [No supersets] sup θ 0 0 s n E θ0 Π n (θ:s θ S θ0,s θ S θ0 Y n ) 0. [Finds big signals] inf θ 0 0 s n E θ0 Π n ( θ:sθ {i: θ 0,i logn} Y n) 1. Corollary: if all nonzero θ 0,i are suitably big, then posterior probability of true model S θ0 tends to 1.

40 Bernstein-von Mises theorem Theorem For spike-and-laplace(λ n )-slab prior with λ n logn/sn 0, there are random weights ŵ S, E θ0 Π n ( Y n ) ŵ S N S (YS n,i) δ S c 0. S Theorem Given consistent model selection, mixture can be replaced by N S0 (Y Sθ0,I) δ S c θ0. Corollary: Given consistent model selection, credible sets for individual parameters are asymptotic confidence sets.

41 Numbers: mean square errors p n A PM PM EBM PMed PMed EBMed HT HTO Average ˆθ θ 2 over 100 data experiments. n = 500; θ 0 = (0,...,0,A,...,A). ( ) PM1, PM2: posterior means for priors π n (k) e k log(3n/k)/10, 2n k 0.1. n PMed1, PMed2 marginal posterior medians for the same priors EBM, EBMed: empirical Bayes mean, median for Laplace prior (Johnstone et al.) HT, HTO: thresholding at 2logn, 2log(n/ θ 0 0 ). Short Summary: Bayesian method is neither better nor worse.

42 Horseshoe Prior

43 Horseshoe prior Y n N n (θ,i), for θ = (θ 1,...,θ n ) R n. Constructive definition of prior Π for θ R p : (1) Choose sparsity level ˆτ. (2) Generate ψ 1,..., ψ n iid from Cauchy + (0,ˆτ). (3) Generate independent θ i N(0,ψ i ). prior shrinkage factor prior of θ i posterior mean of θ i as function of Y i [Carvalho & Polson & Scott, 10.]

44 Estimating τ Ad-hoc: ˆτ n = #{ Y i n 2logn}. 1.1n Empirical Bayes: For g τ the prior of θ i, ˆτ n = argmax τ [1/n,1] n i=1 φ(y i θ)g τ (θ)dθ. Full Bayes: τ set by a hyper prior (supported on [1/n,1]).

45 Numbers estimating τ MSE of posterior mean as function of nonzero parameter s n n = 100, s n coordinates from N(0,1/4), n s n coordinates from N(A,1). p n = s n Short summary: Empirical Bayes and Full Bayes outperform ad-hoc estimator.

46 Recovery Horseshoe prior gives similar recovery as model selection prior.

47 Recovery Horseshoe prior gives similar recovery as model selection prior. τ can be interpreted as (s n /n) log(n/s n ).

48 Credible intervals Credible interval: Ĉ ni (L) = {θ i : θi ˆθ } i Lˆri ˆθ = E(θ Y n ) Π ( θ i : θ i ˆθ i ˆr i Y n) = 0.95

49 Credible intervals Credible interval: Ĉ ni (L) = {θ i : θi ˆθ } i Lˆri ˆθ = E(θ Y n ) Π ( θ i : θ i ˆθ i ˆr i Y n) = 0.95 S a := { 1 i n: θ 0,i 1/n }, M a := { 1 i n:(s n /n) log(n/s n ) θ 0,i log(n/s n ) }. L a := { 1 i n: logn θ 0,i }.

50 Credible intervals Credible interval: Ĉ ni (L) = {θ i : θi ˆθ } i Lˆri ˆθ = E(θ Y n ) Π ( θ i : θ i ˆθ i ˆr i Y n) = 0.95 S a := { 1 i n: θ 0,i 1/n }, M a := { 1 i n:(s n /n) log(n/s n ) θ 0,i log(n/s n ) }. L a := { 1 i n: logn θ 0,i }. marginal credible intervals for a single Y n with n = 200 and s n = 10. θ 1 = = θ 5 = 7, θ 6 = = θ 10 = 1.5. Insert: credible sets 5 to 13.

51 Credible intervals Credible interval: Ĉ ni (L) = {θ i : θi ˆθ } i Lˆri ˆθ = E(θ Y n ) Π ( θ i : θ i ˆθ i ˆr i Y n) = 0.95 S a := { 1 i n: θ 0,i 1/n }, M a := { 1 i n:(s n /n) log(n/s n ) θ 0,i log(n/s n ) }. L a := { 1 i n: logn θ 0,i }. Theorem For any γ > 0 and θ 0 0 s n, ( 1 ) P θ0 #{i S a :θ 0,i #S Ĉni(L S,γ )} 1 γ a 1, ( P θ0 θ0,i / Ĉ ni (L)) 1, for any L > 0 and i M a, ( 1 ) P θ0 #{i L a :θ 0,i #L Ĉni(L L,γ )} 1 γ 1. a Few false discoveries; most easy discoveries made. Intermediate discoveries not made.

52 Simultaneous credible balls impossibility of adaptation General principle: size of honest confidence set is determined by biggest model.

53 Simultaneous credible balls impossibility of adaptation General principle: size of honest confidence set is determined by biggest model. Theorem [Li, 1987] If P θ0 (C n (Y n ) θ 0 ) 0.95, all θ 0 R n, then diam(c n (Y n )) n 1/4, some θ 0.

54 Simultaneous credible balls impossibility of adaptation General principle: size of honest confidence set is determined by biggest model. Theorem [Li, 1987] If P θ0 (C n (Y n ) θ 0 ) 0.95, all θ 0 R n, then diam(c n (Y n )) n 1/4, some θ 0. Theorem [Nickl, van de Geer, 2013] If s 1,n s 2,n and diam(c n (Y n )) is of optimal size, uniformly in θ 0 0 s i,n for i = 1,2, then C n (Y n ) cannot have uniform coverage over {θ 0 : θ 0 0 s 2,n }.

55 Simultaneous credible balls impossibility of adaptation General principle: size of honest confidence set is determined by biggest model. Theorem [Li, 1987] If P θ0 (C n (Y n ) θ 0 ) 0.95, all θ 0 R n, then diam(c n (Y n )) n 1/4, some θ 0. Theorem [Nickl, van de Geer, 2013] If s 1,n s 2,n and diam(c n (Y n )) is of optimal size, uniformly in θ 0 0 s i,n for i = 1,2, then C n (Y n ) cannot have uniform coverage over {θ 0 : θ 0 0 s 2,n }. Since the Bayesian procedure adapts to sparsity, its credible sets cannot be honest confidence sets. [Optimal size is ( (s i,n /n)log(n/s i,n ) ) 1/2.]

56 Simultaneous credible balls impossibility of adaptation restricting the parameter Coverage only when θ 0 does not cause too much shrinkage. DEFINITION [self-similarity] For s = θ 0 0 at least 0.001s coordinates of θ 0 satisfy θ 0,i log(n/s).

57 Simultaneous credible balls impossibility of adaptation restricting the parameter Coverage only when θ 0 does not cause too much shrinkage. DEFINITION [self-similarity] For s = θ 0 0 at least 0.001s coordinates of θ 0 satisfy θ 0,i log(n/s). DEFINITION [excessive-bias restriction, Belitser & Nurushev, 2015] θ 0 s and s with s # ( i: θ 0,i log(n/ s) ) and i: θ 0,i log(n/ s) θ 2 0,i slog(n/ s). Excessive-bias restriction implies self-similarity. (Self-similarity allows to tighten up the sets S,M,L.)

58 Simultaneous credible balls Credible ball: Ĉ n (L) = {θ: θ ˆθ Lˆr } ˆθ = E(θ Y n ) Π ( θ: θ ˆθ ˆr Y n) = 0.95

59 Simultaneous credible balls Credible ball: Ĉ n (L) = {θ: θ ˆθ Lˆr } ˆθ = E(θ Y n ) Π ( θ: θ ˆθ ˆr Y n) = 0.95 Theorem If s n /n 0, for sufficiently large L, lim inf n ( ) inf P θ 0 θ 0 Ĉn(L) θ 0 EBR[s n ] 1 α. EBR[s]: vectors θ 0 that satisfy excessive bias restriction.

n = Short summary: empirical and full Bayes

60 Numbers coverage average interval length n = 400. s n ( = p ) nonzero means from N(A,1). n = 400. s n ( = p ) nonzero means from N(A,1). Short summary: empirical and full Bayes work well

61 Conclusions Bayesian sparse estimation gives excellent recovery. For valid simultaneous credible sets need a fraction of nonzero parameters above the universal threshold. The danger of failing uncertainty quantification is not finding nonzero coordinates. Discoveries are real.

62 Co-authors Subhashis Ghoshal Harry van Zanten Ismael Castillo Johannes Schmidt-Hieber Stéphanie van der Pas Botond Szabo Willem Kruijer Magnus Münch Bartek Knapik Bas Kleijn Suzanne Sniekers Fengnan Gao Gwenael Leday Gino Kpogbezan Mark van de Wiel Wessel van Wieringen

63 Compatibility and coherence X := max j X.,j. Compatibility number φ(s) for S {1,...,p} is: Xθ 2 S inf. θ S c 1 7 θ 1 X θ S 1 Compatibility in s n -sparse vectors means: Xθ 2 Sθ inf 0. θ: θ 0 5s n X θ 1 Strong compatibility ins n -sparse vectors means: Mutual coherence means: Xθ 2 inf 0. θ: θ 0 5s n X θ 2 s n max cor(x.i,x.j ) 1. i j

64 Compatibility and coherence examples Mutual coherence Strong compatibility Compatibility. Mutual coherence is easy to understand and gives best recovery results, but is very restrictive.

65 Compatibility and coherence examples Mutual coherence Strong compatibility Compatibility. Mutual coherence is easy to understand and gives best recovery results, but is very restrictive. Sequence model: Completely compatible, with zero mutual coherence number.

66 Compatibility and coherence examples Mutual coherence Strong compatibility Compatibility. Mutual coherence is easy to understand and gives best recovery results, but is very restrictive. Sequence model: Completely compatible, with zero mutual coherence number. Response model: If X i,j are i.i.d. random variables, then coherence if s n n/logp. if logp = o(n) and X i,j are bounded. if logp = o(n α/(4+α) ) and E txα i,n <.

67 Compatibility and coherence examples Mutual coherence Strong compatibility Compatibility. Mutual coherence is easy to understand and gives best recovery results, but is very restrictive. Sequence model: Completely compatible, with zero mutual coherence number. Response model: If X i,j are i.i.d. random variables, then coherence if s n n/logp. if logp = o(n) and X i,j are bounded. if logp = o(n α/(4+α) ) and E txα i,n <. C = X T X/n: Compatibility, but no coherence if C i,j = ρ i j, for 0 < ρ < 1, and p = n. C is block diagonal with fixed block sizes.

Bayesian Sparse Linear Regression with Unknown Symmetric Error

Bayesian Sparse Linear Regression with Unknown Symmetric Error Minwoo Chae 1 Joint work with Lizhen Lin 2 David B. Dunson 3 1 Department of Mathematics, The University of Texas at Austin 2 Department of